7.2.3 vacuum bug

Started by Rod Taylorabout 23 years ago14 messages
#1Rod Taylor
rbt@rbt.ca

ERROR: RelationClearRelation: relation 11584078 deleted while still in
use

I've been unable to come up with a test case that will cause the
problem, seems to be timing related. The queries that are currently
running when these errors occur do a lot or work with temp tables that
are frequently truncated.

--
Rod Taylor

#2Neil Conway
neilc@samurai.com
In reply to: Rod Taylor (#1)
Re: 7.2.3 vacuum bug

Rod Taylor <rbt@rbt.ca> writes:

ERROR: RelationClearRelation: relation 11584078 deleted while still in
use

I was going to report a similar error that arises in a different
situation:

client 1:

CREATE TABLE a (b int);
BEGIN;
DROP TABLE a;
-- wait

client 2:

SELECT * FROM a;

client 1:

COMMIT;

Now, client 2 will receive "RelationClearRelation: relation 25172
deleted while still in use", rather than "Relation "a" does not
exist", as you might expect. Not sure if it's the same bug, or just a
different problem...

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#3Rod Taylor
rbt@rbt.ca
In reply to: Neil Conway (#2)
Re: 7.2.3 vacuum bug

On Wed, 2002-10-30 at 15:38, Neil Conway wrote:

Rod Taylor <rbt@rbt.ca> writes:

ERROR: RelationClearRelation: relation 11584078 deleted while still in
use

I was going to report a similar error that arises in a different
situation:

Probably a different look at the same problem.

--
Rod Taylor

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#2)
Re: 7.2.3 vacuum bug

Neil Conway <neilc@samurai.com> writes:

client 1:

CREATE TABLE a (b int);
BEGIN;
DROP TABLE a;
-- wait

client 2:

SELECT * FROM a;

client 1:

COMMIT;

Now, client 2 will receive "RelationClearRelation: relation 25172
deleted while still in use", rather than "Relation "a" does not
exist", as you might expect.

But relation "a" *does* exist at the start of client 2's operation.
While I'm not here to defend the exact phrasing of this error message,
it does seem to me that it's appropriate to give a different error
message than what appears when the table wasn't found at all.

An example of why the two cases shouldn't be folded together: suppose
that client 2's schema search path is "myschema, public", and that
client 1 creates/drops myschema.a while there is also a public.a.
client 2 will locate myschema.a as the meaning of "a", and one way or
another it is going to error out when myschema.a gets dropped from
underneath it --- it will not (and shouldn't IMHO) go back and repeat
the schema search to find public.a. But a user who gets a "Relation "a"
does not exist" error message in such a scenario would be justifiably
confused.

regards, tom lane

#5Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#4)
Re: 7.2.3 vacuum bug

Tom Lane <tgl@sss.pgh.pa.us> writes:

But relation "a" *does* exist at the start of client 2's operation.
While I'm not here to defend the exact phrasing of this error message,
it does seem to me that it's appropriate to give a different error
message than what appears when the table wasn't found at all.

Ok, fair enough -- I agree that we should treat the two cases
differently. But one thing I think we should do in any case is improve
the wording of the error message.

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Rod Taylor (#1)
Re: 7.2.3 vacuum bug

Rod Taylor <rbt@rbt.ca> writes:

ERROR: RelationClearRelation: relation 11584078 deleted while still in
use

I've been unable to come up with a test case that will cause the
problem, seems to be timing related. The queries that are currently
running when these errors occur do a lot or work with temp tables that
are frequently truncated.

Hm. vacuum.c tries to avoid this class of problem:

/*
* Race condition -- if the pg_class tuple has gone away since the
* last time we saw it, we don't need to vacuum it.
*/
if (!SearchSysCacheExists(RELOID,
ObjectIdGetDatum(relid),
0, 0, 0))
{
CommitTransactionCommand(true);
return true; /* okay 'cause no data there */
}

...

onerel = relation_open(relid, lmode);

but on reflection it's clear that this doesn't really prevent a race
condition. If the table is already exclusive-locked by a DROP TABLE
that hasn't committed yet (eg, the implicit DROP that happens when temp
tables are cleared out at backend exit), then the syscache lookup will
go fine, but the relation_open() routine blocks waiting for lock and
eventually fails.

What would probably work better is to first lock the relation OID,
then see if we can open the relation or not.

Thinking further, it's really kinda bogus that LockRelation() works on
an already-opened Relation; if possible we should acquire the lock
before attempting to create a relcache entry. (We only need to know the
OID and the relisshared status before we can make a locktag, so it'd be
possible to acquire the lock using only the contents of the pg_class row.)
Not sure how much code restructuring might be involved to make this
happen, but it'd be worth thinking about for 7.4.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#5)
Re: 7.2.3 vacuum bug

Neil Conway <neilc@samurai.com> writes:

Ok, fair enough -- I agree that we should treat the two cases
differently. But one thing I think we should do in any case is improve
the wording of the error message.

Got a suggestion?

regards, tom lane

#8scott.marlowe
scott.marlowe@ihs.com
In reply to: Tom Lane (#7)
Re: 7.2.3 vacuum bug

On Thu, 31 Oct 2002, Tom Lane wrote:

Neil Conway <neilc@samurai.com> writes:

Ok, fair enough -- I agree that we should treat the two cases
differently. But one thing I think we should do in any case is improve
the wording of the error message.

Got a suggestion?

Change: RelationClearRelation: relation 25172 deleted while still in use
to: RelationClearRelation: a relation (id: 25172) was deleted while still
in use

#9Rod Taylor
rbt@rbt.ca
In reply to: scott.marlowe (#8)
Re: 7.2.3 vacuum bug

Found another:

ERROR: cannot find attribute 2 of relation pg_temp_12100_0

On Thu, 2002-10-31 at 11:33, scott.marlowe wrote:

On Thu, 31 Oct 2002, Tom Lane wrote:

Neil Conway <neilc@samurai.com> writes:

Ok, fair enough -- I agree that we should treat the two cases
differently. But one thing I think we should do in any case is improve
the wording of the error message.

Got a suggestion?

Change: RelationClearRelation: relation 25172 deleted while still in use
to: RelationClearRelation: a relation (id: 25172) was deleted while still
in use

--
Rod Taylor

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Rod Taylor (#9)
Re: 7.2.3 vacuum bug

Rod Taylor <rbt@rbt.ca> writes:

Found another:
ERROR: cannot find attribute 2 of relation pg_temp_12100_0

Can you reproduce that?

It could be that this just represents someone's temp table deletion
committing while VACUUM is partway through trying to build a relcache
entry to open the relation. If so, it is only another manifestation
of the should-lock-before-relation-open problem.

regards, tom lane

#11Rod Taylor
rbt@rbt.ca
In reply to: Tom Lane (#10)
Re: 7.2.3 vacuum bug

On Thu, 2002-10-31 at 13:03, Tom Lane wrote:

Rod Taylor <rbt@rbt.ca> writes:

Found another:
ERROR: cannot find attribute 2 of relation pg_temp_12100_0

Can you reproduce that?

It could be that this just represents someone's temp table deletion
committing while VACUUM is partway through trying to build a relcache
entry to open the relation. If so, it is only another manifestation
of the should-lock-before-relation-open problem.

Yes, but not easily (very timing dependent), takes a lot of worker
processes to throw it. So it's likely a part of the locking issue.

--
Rod Taylor

#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#6)
Re: 7.2.3 vacuum bug

Is this a TODO?

---------------------------------------------------------------------------

Tom Lane wrote:

Rod Taylor <rbt@rbt.ca> writes:

ERROR: RelationClearRelation: relation 11584078 deleted while still in
use

I've been unable to come up with a test case that will cause the
problem, seems to be timing related. The queries that are currently
running when these errors occur do a lot or work with temp tables that
are frequently truncated.

Hm. vacuum.c tries to avoid this class of problem:

/*
* Race condition -- if the pg_class tuple has gone away since the
* last time we saw it, we don't need to vacuum it.
*/
if (!SearchSysCacheExists(RELOID,
ObjectIdGetDatum(relid),
0, 0, 0))
{
CommitTransactionCommand(true);
return true; /* okay 'cause no data there */
}

...

onerel = relation_open(relid, lmode);

but on reflection it's clear that this doesn't really prevent a race
condition. If the table is already exclusive-locked by a DROP TABLE
that hasn't committed yet (eg, the implicit DROP that happens when temp
tables are cleared out at backend exit), then the syscache lookup will
go fine, but the relation_open() routine blocks waiting for lock and
eventually fails.

What would probably work better is to first lock the relation OID,
then see if we can open the relation or not.

Thinking further, it's really kinda bogus that LockRelation() works on
an already-opened Relation; if possible we should acquire the lock
before attempting to create a relcache entry. (We only need to know the
OID and the relisshared status before we can make a locktag, so it'd be
possible to acquire the lock using only the contents of the pg_class row.)
Not sure how much code restructuring might be involved to make this
happen, but it'd be worth thinking about for 7.4.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#12)
Re: 7.2.3 vacuum bug

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Is this a TODO?

Yes. Maybe

* Acquire lock on a relation before building a relcache entry for it

I'm not quite sure yet how this should interact with the case where
you already have a relcache entry, but certainly the existing behavior
of "build the whole entry and then acquire lock" is not good.

regards, tom lane

#14Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#13)
Re: 7.2.3 vacuum bug

Added to TODO.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Is this a TODO?

Yes. Maybe

* Acquire lock on a relation before building a relcache entry for it

I'm not quite sure yet how this should interact with the case where
you already have a relcache entry, but certainly the existing behavior
of "build the whole entry and then acquire lock" is not good.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073