Cache lookup failed for relation

Started by David Clymerabout 13 years ago11 messagesgeneral
Jump to latest
#1David Clymer
david.clymer@vistashare.com

I've been seeing the following error in one database of ours:

"cache lookup failed for relation 7640518"

The SQL that apparently triggers this is:

drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel queries,
and the tables are by nature, short lived. That said, they are not
temporary tables.

This is one of two very similar databases, and we are running the same
software (same version) on top of each. The databases are in different
versions of postgresql. Db #1 is postgresql 9.2.3, and db #2 (the one
exhibiting the above behavior) is postgresql 9.0.11.

One other item of note: db #2 has recently had an OID wrap-around, which
makes me suspect that plays some part in this behavior. I've looked at the
caching code in postgresql, and though I cannot claim to have a thorough
understanding of how it works, I have a theory.

How the cache works (as I understand it):

When a cache lookup is performed, tuples are looked up by OID. The
cache contents are hashed into buckets. If an item is found in the cache,
it is promoted to the top of the bucket so that subsequent searches are
faster. If an item is not in the cache, it is looked up in the system
catalog, and an entry is inserted into the cache. If a lookup in the
catalog fails, a negative entry is added to the cache for the tuple.
Multiple entries can exist for the same tuple. The latest one is just
promoted to the top of the bucket, and the other gets aged out of the
cache, since it is never again accessed.

Theory:

Given that we have wrapped around our OID counter, it is possible to
have multiple entries in the cache for the same OID. If one relation is
deleted, and a negative entry inserted into the cache, attempts to look up
the other may erroneously produce a negative cache hit, yielding our "cache
lookup failed for relation" error.

Is this a possibility? Are there any other obvious explanation for this?
The results from google related to this error seem to point to catalog
corruption, or a postgres bug.

Any pointers/enlightenment would be appreciated.

-davidc

--
*David Clymer*
VistaShare
866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/&gt;

[image: Facebook] www.facebook.com/vistashare
[image: Twitter] www.twitter.com/vistashare

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Clymer (#1)
Re: Cache lookup failed for relation

David Clymer <david.clymer@vistashare.com> writes:

I've been seeing the following error in one database of ours:
"cache lookup failed for relation 7640518"

Always the same OID, or does it change?

The SQL that apparently triggers this is:
drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel queries,
and the tables are by nature, short lived. That said, they are not
temporary tables.

Hm ... what are the parallel queries exactly? If you're doing something
like dropping both ends of a foreign-key linkage in parallel, I'd not be
very astonished by an error like this, especially not in 9.0.x. It'd be
basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table. (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

One other item of note: db #2 has recently had an OID wrap-around, which
makes me suspect that plays some part in this behavior.

I don't believe that theory at all.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tom Lane (#2)
Re: Cache lookup failed for relation

2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:

David Clymer <david.clymer@vistashare.com> writes:

I've been seeing the following error in one database of ours:
"cache lookup failed for relation 7640518"

Always the same OID, or does it change?

The SQL that apparently triggers this is:
drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel queries,
and the tables are by nature, short lived. That said, they are not
temporary tables.

Hm ... what are the parallel queries exactly? If you're doing something
like dropping both ends of a foreign-key linkage in parallel, I'd not be
very astonished by an error like this, especially not in 9.0.x. It'd be
basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table. (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

we can see same behave in 9.1

when you try drop some tables in parallel sessions

Regards

Pavel Stehule

One other item of note: db #2 has recently had an OID wrap-around, which
makes me suspect that plays some part in this behavior.

I don't believe that theory at all.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Péter Kovács
peter.dunay.kovacs@gmail.com
In reply to: Tom Lane (#2)
Re: Cache lookup failed for relation

I'm confused by the error message. Is a cache miss an error condition?

Thanks
Peter
On Feb 11, 2013 6:22 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Show quoted text

David Clymer <david.clymer@vistashare.com> writes:

I've been seeing the following error in one database of ours:
"cache lookup failed for relation 7640518"

Always the same OID, or does it change?

The SQL that apparently triggers this is:
drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel

queries,

and the tables are by nature, short lived. That said, they are not
temporary tables.

Hm ... what are the parallel queries exactly? If you're doing something
like dropping both ends of a foreign-key linkage in parallel, I'd not be
very astonished by an error like this, especially not in 9.0.x. It'd be
basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table. (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

One other item of note: db #2 has recently had an OID wrap-around, which
makes me suspect that plays some part in this behavior.

I don't believe that theory at all.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Péter Kovács (#4)
Re: Cache lookup failed for relation

=?UTF-8?B?UMOpdGVyIEtvdsOhY3M=?= <peter.dunay.kovacs@gmail.com> writes:

I'm confused by the error message. Is a cache miss an error condition?

Well, this isn't a "cache miss", it's more of a "there's no such OID in
the pg_class catalog" condition. Normally you see something more
user-friendly; but in the case of going to remove a cross-table linkage,
the code isn't expecting the other table to not be there, so you get a
pretty low-level error.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6David Clymer
david.clymer@vistashare.com
In reply to: Tom Lane (#2)
Re: Cache lookup failed for relation

On Mon, Feb 11, 2013 at 12:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Clymer <david.clymer@vistashare.com> writes:

I've been seeing the following error in one database of ours:
"cache lookup failed for relation 7640518"

Always the same OID, or does it change?

It appears that almost all instances reference a different OID.

The SQL that apparently triggers this is:
drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel

queries,

and the tables are by nature, short lived. That said, they are not
temporary tables.

Hm ... what are the parallel queries exactly?

Sorry, that's our application level terminology. As far as postgres is
concerned they are just individual queries running at the roughly same time.

If you're doing something
like dropping both ends of a foreign-key linkage in parallel, I'd not be
very astonished by an error like this, especially not in 9.0.x. It'd be
basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table. (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

I don't think we are doing that, but it may be that two queries are
attempting to drop the same table "if exists". I'll have to look at that a
bit more.

The SERIALIZABLE isolation mode is being used in 9.0, and REPEATABLE READ
in 9.2, which should be the same thing, correct (eg. 9.0 serializable ~ 9.2
repeatable read)?

One other item of note: db #2 has recently had an OID wrap-around, which
makes me suspect that plays some part in this behavior.

I don't believe that theory at all.

Good to know.

-davidc

--
*David Clymer*
VistaShare
866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/&gt;

[image: Facebook] www.facebook.com/vistashare
[image: Twitter] www.twitter.com/vistashare

#7David Clymer
david.clymer@vistashare.com
In reply to: Pavel Stehule (#3)
Re: Cache lookup failed for relation

On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote:

2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:

David Clymer <david.clymer@vistashare.com> writes:

I've been seeing the following error in one database of ours:
"cache lookup failed for relation 7640518"

Always the same OID, or does it change?

The SQL that apparently triggers this is:
drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel

queries,

and the tables are by nature, short lived. That said, they are not
temporary tables.

Hm ... what are the parallel queries exactly? If you're doing something
like dropping both ends of a foreign-key linkage in parallel, I'd not be
very astonished by an error like this, especially not in 9.0.x. It'd be
basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table. (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

we can see same behave in 9.1

when you try drop some tables in parallel sessions

OK,
--
*David Clymer*
VistaShare
866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/&gt;

[image: Facebook] www.facebook.com/vistashare
[image: Twitter] www.twitter.com/vistashare

#8David Clymer
david.clymer@vistashare.com
In reply to: David Clymer (#7)
Re: Cache lookup failed for relation

On Mon, Feb 11, 2013 at 1:13 PM, David Clymer
<david.clymer@vistashare.com>wrote:

On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote:

2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:

David Clymer <david.clymer@vistashare.com> writes:

I've been seeing the following error in one database of ours:
"cache lookup failed for relation 7640518"

Always the same OID, or does it change?

The SQL that apparently triggers this is:
drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel

queries,

and the tables are by nature, short lived. That said, they are not
temporary tables.

Hm ... what are the parallel queries exactly? If you're doing something
like dropping both ends of a foreign-key linkage in parallel, I'd not be
very astonished by an error like this, especially not in 9.0.x. It'd be
basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table. (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

we can see same behave in 9.1

when you try drop some tables in parallel sessions

OK, so perhaps the difference is purely due to the use of postgres < 9.2 on
one db.

-davidc

--
*David Clymer*
VistaShare
866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/&gt;

[image: Facebook] www.facebook.com/vistashare
[image: Twitter] www.twitter.com/vistashare

#9Pavel Stehule
pavel.stehule@gmail.com
In reply to: David Clymer (#8)
Re: Cache lookup failed for relation

2013/2/11 David Clymer <david.clymer@vistashare.com>

On Mon, Feb 11, 2013 at 1:13 PM, David Clymer <david.clymer@vistashare.com

wrote:

On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote:

2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:

David Clymer <david.clymer@vistashare.com> writes:

I've been seeing the following error in one database of ours:
"cache lookup failed for relation 7640518"

Always the same OID, or does it change?

The SQL that apparently triggers this is:
drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel

queries,

and the tables are by nature, short lived. That said, they are not
temporary tables.

Hm ... what are the parallel queries exactly? If you're doing

something

like dropping both ends of a foreign-key linkage in parallel, I'd not

be

very astonished by an error like this, especially not in 9.0.x. It'd

be

basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table. (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

we can see same behave in 9.1

when you try drop some tables in parallel sessions

OK, so perhaps the difference is purely due to the use of postgres < 9.2
on one db.

yes, I have not 9.2 now, but on 9.3 you get user friendly message

NOTICE: table "foo" does not exist, skipping
DROP TABLE

Regards

Pavel

Show quoted text

-davidc

--
*David Clymer*
VistaShare
866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/&gt;

[image: Facebook] www.facebook.com/vistashare
[image: Twitter] www.twitter.com/vistashare

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: David Clymer (#6)
Re: Cache lookup failed for relation

David Clymer <david.clymer@vistashare.com> wrote:

The SERIALIZABLE isolation mode is being used in 9.0, and
REPEATABLE READ in 9.2, which should be the same thing, correct
(eg. 9.0 serializable ~ 9.2 repeatable read)?

Correct.

In 9.0 SERIALIZABLE and REPEATABLE READ are exactly same.  In 9.1
and later REPEATABLE READ has not changed from 9.0, but
SERIALIZABLE has basically become REPEATABLE READ with extra
checking for serialization failures.

-Kevin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Péter Kovács
peter.dunay.kovacs@gmail.com
In reply to: Tom Lane (#5)
Re: Cache lookup failed for relation

Thank you, Tom!

Recently, as I wanted to drop a schema (PG 8.2), I had to wade through a
number of such messages and keep dropping rows in the pg_dependency table
with the OID specified in the messages. When I was finally able to drop the
schema, I wanted to drop the user (who previously owned the schema), but I
got the same message again. I can imagine that something may have gone
wrong in the schema at the application level, but how come I am getting the
same messages when trying to drop a user?

Thanks,

Peter

On Mon, Feb 11, 2013 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

=?UTF-8?B?UMOpdGVyIEtvdsOhY3M=?= <peter.dunay.kovacs@gmail.com> writes:

I'm confused by the error message. Is a cache miss an error condition?

Well, this isn't a "cache miss", it's more of a "there's no such OID in
the pg_class catalog" condition. Normally you see something more
user-friendly; but in the case of going to remove a cross-table linkage,
the code isn't expecting the other table to not be there, so you get a
pretty low-level error.

regards, tom lane