ERROR: cache lookup failed for relation 17442 (repost)
Hi All,
I've been getting these errors ("ERROR: cache lookup failed for
relation 17442") in my logs for a while now. It originally seemed
like a hardware problem, however now we getting them pretty consistently
on a couple servers. I've scalled down the schema to the one table and
the function involved and included a code snipet to make a bunch of
connections and loop around calling the same function. It usually
takes 100-2000 iterations before these messages start appearing in the
log. I've also included the original function, this takes 10,000
iterations for the error to start showing. I should note, we've been
getting these erros since version 7, this is the first time they were
reproducable..
With the original function, the log messages were slightly different and
usually caused the server to reset:
i.e.
ERROR: type "t" already exists
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
CONTEXT: SQL statement "create temp table tmp_children ( uniqid bigint,
memberid bigint, membertype varchar(50), ownerid smallint, tag
varchar(50), level int4 )"
PL/pgSQL function "fngetcompositeids2" line 14 at SQL statement
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR: cache lookup failed for type 2449707570
FATAL: cache lookup failed for type 2449707570
Environment info: Postgres v8, suse linix with latest kernal patches,
filesystem: reiserfs.
Please let me know if you need anymore information. No data is need,
just the schema included.
Thanks
Michael
Michael Guerin <guerin@rentec.com> writes:
I've been getting these errors ("ERROR: cache lookup failed for
relation 17442") in my logs for a while now.
Turning on verbose error logging shows that the error invariably comes
from RelationIsVisible():
2005-02-07 16:44:35 EST 2994 ERROR: XX000: cache lookup failed for relation 168051
2005-02-07 16:44:35 EST 2994 CONTEXT: SQL statement "SELECT EXISTS (SELECT relname FROM pg_class WHERE relname ~* ('^' || $1 || '$') AND relkind = 'r' and pg_table_is_visible(oid))"
PL/pgSQL function "table_exists" line 7 at if
PL/pgSQL function "fngetcompositeids2" line 13 at if
2005-02-07 16:44:35 EST 2994 LOCATION: RelationIsVisible, namespace.c:320
What is happening, I believe, is that the SELECT finds a row in pg_class
that has already been deleted (because the temp table in question
belonged to another backend that already dropped it). Under MVCC rules
the row is still visible to the SELECT. However, pg_table_is_visible
depends on backend internal catalog lookup functions, and those lookups
always follow SnapshotNow rules ... so there is a window in which the
pg_table_is_visible call can fail because the other backend's DROP
already committed.
The only quick-fix way I can see for making this safe is to modify
pg_table_is_visible and friends to silently return FALSE instead of
erroring out when they are handed an OID that doesn't seem to match any
existing catalog row. There was some talk of doing this once before:
http://archives.postgresql.org/pgsql-hackers/2002-11/msg00067.php
but we never got around to it, mainly because it seemed too likely to
mask mistakes.
A more general issue is that we now have a pile of "catalog inquiry"
functions that all make use of backend internal lookups and therefore
reflect SnapshotNow behavior to the user. This has bothered me for some
time, mainly because I'm worried that it could result in pg_dump failing
to emit a consistent snapshot --- eg, it could emit a CREATE VIEW
command that reflects a view change that occurred later than the start
of the dump. Defending against this looks horridly messy though :-(
... it almost seems to require duplicate sets of catalog lookup code.
In the meantime, you might want to think about identifying your own temp
tables by matching their relnamespace to current_schemas(), instead of
relying on pg_table_is_visible.
regards, tom lane
Also ... the visibility issue doesn't seem to explain the other symptom
you reported:
With the original function, the log messages were slightly different and
usually caused the server to reset:
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
What was the "original function" exactly?
regards, tom lane
Michael Guerin wrote:
Hi All,
I've been getting these errors ("ERROR: cache lookup failed for
relation 17442") in my logs for a while now. It originally seemed
like a hardware problem, however now we getting them pretty consistently
on a couple servers. I've scalled down the schema to the one table and
the function involved and included a code snipet to make a bunch of
connections and loop around calling the same function. It usually
takes 100-2000 iterations before these messages start appearing in the
log. I've also included the original function, this takes 10,000
iterations for the error to start showing. I should note, we've been
getting these erros since version 7, this is the first time they were
reproducable..With the original function, the log messages were slightly different and
usually caused the server to reset:
i.e.
ERROR: type "t" already exists
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
CONTEXT: SQL statement "create temp table tmp_children ( uniqid bigint,
memberid bigint, membertype varchar(50), ownerid smallint, tag
varchar(50), level int4 )"
PL/pgSQL function "fngetcompositeids2" line 14 at SQL statement
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR: cache lookup failed for type 2449707570
FATAL: cache lookup failed for type 2449707570Environment info: Postgres v8, suse linix with latest kernal patches,
filesystem: reiserfs.Please let me know if you need anymore information. No data is need,
just the schema included.Thanks
Michael
Michael,
The interesting thing about this bug is: We had the same thing on a
customer's machine some time ago. It actually occurred after a certain
script (nothing big) was run the 100.001st time (maybe) on an empty
database. So this one does not seem to be related to the schema - it is
more or less random ...
The interesting thing is: We copied the data directory from the customer
and we were not able to reproduce the same behaviour on a different machine.
The strange thing is: After doing a checkpoint and restarting the
database the problem still occurred. Starting the same binary thing on a
different machine did not show that error ...
We stepped through it with gdb but we could not find anything strange ...
Can you reliably reproduce the problem after a arbitrary amount of
iterations on a different machine? We couldn't ...
Looking at the code: This is a null pointer caught by the system ...
Something seems to corrupt memory ...
Hans
--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/660/816 40 77
www.cybertec.at, www.postgresql.at
On Mon, Feb 07, 2005 at 05:07:47PM -0500, Tom Lane wrote:
A more general issue is that we now have a pile of "catalog inquiry"
functions that all make use of backend internal lookups and therefore
reflect SnapshotNow behavior to the user. This has bothered me for
some
time,
Can we pass a Snapshot as a parameter to the internal lookups? Or set a
global variable to indicate this is not really a internal lookup, and
reset it afterwards?
--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"The problem with the future is that it keeps turning into the present"
(Hobbes)
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
On Mon, Feb 07, 2005 at 05:07:47PM -0500, Tom Lane wrote:
A more general issue is that we now have a pile of "catalog inquiry"
functions that all make use of backend internal lookups and therefore
reflect SnapshotNow behavior to the user.
Can we pass a Snapshot as a parameter to the internal lookups? Or set a
global variable to indicate this is not really a internal lookup, and
reset it afterwards?
The first seems awfully messy, and the second awfully dangerous.
regards, tom lane
Tom Lane wrote:
Also ... the visibility issue doesn't seem to explain the other symptom
you reported:With the original function, the log messages were slightly different and
usually caused the server to reset:
ERROR: duplicate key violates unique constraint
"pg_type_typname_nsp_index"What was the "original function" exactly?
regards, tom lane
It's included in the schema call uspgetcompositeids2_orig. The only
differences is that this function creates and drops the temp table
within the function, the newer function keeps the temp table around for
the life of the connections.
-michael
Michael Guerin <guerin@rentec.com> writes:
Tom Lane wrote:
What was the "original function" exactly?
It's included in the schema call uspgetcompositeids2_orig. The only
differences is that this function creates and drops the temp table
within the function, the newer function keeps the temp table around for
the life of the connections.
So the appropriate test script looks like what?
regards, tom lane
Tom Lane wrote:
Michael Guerin <guerin@rentec.com> writes:
Tom Lane wrote:
What was the "original function" exactly?
It's included in the schema call uspgetcompositeids2_orig. The only
differences is that this function creates and drops the temp table
within the function, the newer function keeps the temp table around for
the life of the connections.So the appropriate test script looks like what?
regards, tom lane
I'll have to try and re-create this error, I thought the two were
releated. I'll post it when I can replicate that error, so far I
haven't been able to duplicate it.
thanks
michael