ERROR: btree scan list trashed ??

Started by Adriaan Joubertover 26 years ago11 messages
#1Adriaan Joubert
a.joubert@albourne.com

I've been getting the following error out of the backend (followed by a
general demise of the system)

ERROR: btree scan list trashed; can't find 0x1401e38c0

Anybody know what this means? I've spent a day trying to figure out what
I'm doing wrong, but this gives me no clues. It seems to happen
completely intermittently and seems to depend on a combination of
clients doing something at the same time.

Any help greatly appreciated!

Adriaan

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Adriaan Joubert (#1)
Re: [HACKERS] ERROR: btree scan list trashed ??

Adriaan Joubert <a.joubert@albourne.com> writes:

I've been getting the following error out of the backend (followed by a
general demise of the system)
ERROR: btree scan list trashed; can't find 0x1401e38c0
Anybody know what this means? I've spent a day trying to figure out what
I'm doing wrong, but this gives me no clues.

I doubt you are doing anything "wrong" ... you're just getting burnt
by some internal bug.

What Postgres version are you using, and on what platform? If it's
anything older than 6.5.1, an upgrade would probably be a good idea.

If you are seeing the problem in 6.5.1, then we need to try to fix it.
A reproducible test case would help a lot in finding the bug.

It seems to happen completely intermittently and seems to depend on a
combination of clients doing something at the same time.

Ugh. Getting a reproducible test case might be hard...

BTW, another possible stopgap is to rebuild (drop and re-create) all
your indexes. If the problem is being triggered by a corrupted index
then that should make it go away, at least until the next time the
index gets corrupted.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: [HACKERS] ERROR: btree scan list trashed ??

Adriaan Joubert <a.joubert@albourne.com> writes:

What Postgres version are you using, and on what platform? If it's
anything older than 6.5.1, an upgrade would probably be a good idea.

Sorry, I should have mentioined that. I'm using 6.5.0 on DEC Alpha
(Digital Unix, compiled with cc).

Alpha, eh? We have some known porting problems on 64-bit architectures,
and I wonder whether this is one of them. Going to be hard to nail that
down until we can reproduce the error, however.

After some digging around in backend/access/nbtree/nbtscan.c, which is
producing the error, I notice that the routine in question is searching
a list that does not get cleared properly at transaction abort. It's
not clear that that's the cause of the error message, though. What
I suggest at this point is that you pay more attention to what happens
just before the transaction in which you get the "btree scan list
trashed" message. In particular, are there any commands that abort
with errors a little bit earlier in the same backend? It might take
the combination of an error in a btree-index-using command and then
another btree index access to provoke the "trashed" symptom.

regards, tom lane

#4Adriaan Joubert
a.joubert@albourne.com
In reply to: Tom Lane (#3)
Re: [HACKERS] ERROR: btree scan list trashed ??

Tom Lane wrote:

Adriaan Joubert <a.joubert@albourne.com> writes:

What Postgres version are you using, and on what platform? If it's
anything older than 6.5.1, an upgrade would probably be a good idea.

Sorry, I should have mentioined that. I'm using 6.5.0 on DEC Alpha
(Digital Unix, compiled with cc).

Alpha, eh? We have some known porting problems on 64-bit architectures,
and I wonder whether this is one of them. Going to be hard to nail that
down until we can reproduce the error, however.

After some digging around in backend/access/nbtree/nbtscan.c, which is
producing the error, I notice that the routine in question is searching
a list that does not get cleared properly at transaction abort. It's
not clear that that's the cause of the error message, though. What
I suggest at this point is that you pay more attention to what happens
just before the transaction in which you get the "btree scan list
trashed" message. In particular, are there any commands that abort
with errors a little bit earlier in the same backend? It might take
the combination of an error in a btree-index-using command and then
another btree index access to provoke the "trashed" symptom.

That may be it. I have some PL routines that raise an exception if an
operation could lead to an inconsistency in my database tables. This is
not really an error, but I do want to abort the transaction in that
case, so I raise an exception. I'll continue trying to nail it down and
then I can look with the debugger what happens.

BTW, I've installed 6.5.1 and still have the same problems. Vacuuming
hung up everything, and I had to shut the whole thing down and restart
it to get it working again. Dropping the indices and rebuilding them all
fixed the problem.

How difficult is it to clear the list at transaction abort? Is this
something I could patch and try out?

Thanks a lot for looking at this, much appreciated!

Adriaan

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Adriaan Joubert (#4)
Re: [HACKERS] ERROR: btree scan list trashed ??

Adriaan Joubert <a.joubert@albourne.com> writes:

After some digging around in backend/access/nbtree/nbtscan.c, which is
producing the error, I notice that the routine in question is searching
a list that does not get cleared properly at transaction abort. It's
not clear that that's the cause of the error message, though.

BTW, I've installed 6.5.1 and still have the same problems.

No surprise, really.

Vacuuming hung up everything, and I had to shut the whole thing down
and restart it to get it working again. Dropping the indices and
rebuilding them all fixed the problem.

Hmm, that suggests that your indexes are actually getting corrupted.

How difficult is it to clear the list at transaction abort? Is this
something I could patch and try out?

The BTScans variable in nbtscan.c needs to be reset to NULL during
xact abort. I don't see how this would *directly* cause the
observed symptom, but failing to do it should lead to misbehavior in
_bt_adjscans() during later transactions, so it might be related
somehow. If you want to patch it, make a subroutine that clears the
variable (no need to free the list; since it's palloc'd it'll go
away anyway) and call it from transaction cleanup in
backend/access/transam/xact.c.

regards, tom lane

#6Vadim Mikheev
vadim@krs.ru
In reply to: Tom Lane (#5)
Re: [HACKERS] ERROR: btree scan list trashed ??

Tom Lane wrote:

The BTScans variable in nbtscan.c needs to be reset to NULL during
xact abort. I don't see how this would *directly* cause the
observed symptom, but failing to do it should lead to misbehavior in
_bt_adjscans() during later transactions, so it might be related
somehow. If you want to patch it, make a subroutine that clears the
variable (no need to free the list; since it's palloc'd it'll go
away anyway) and call it from transaction cleanup in
backend/access/transam/xact.c.

This should be fixed in CVS too.

Vadim

#7The Hermit Hacker
scrappy@hub.org
In reply to: Vadim Mikheev (#6)
Re: [HACKERS] ERROR: btree scan list trashed ??

On Thu, 5 Aug 1999, Vadim Mikheev wrote:

Tom Lane wrote:

The BTScans variable in nbtscan.c needs to be reset to NULL during
xact abort. I don't see how this would *directly* cause the
observed symptom, but failing to do it should lead to misbehavior in
_bt_adjscans() during later transactions, so it might be related
somehow. If you want to patch it, make a subroutine that clears the
variable (no need to free the list; since it's palloc'd it'll go
away anyway) and call it from transaction cleanup in
backend/access/transam/xact.c.

This should be fixed in CVS too.

Is this something that can be easily back-patched for v6.5.2?

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#7)
Re: [HACKERS] ERROR: btree scan list trashed ??

The Hermit Hacker <scrappy@hub.org> writes:

On Thu, 5 Aug 1999, Vadim Mikheev wrote:

Tom Lane wrote:

The BTScans variable in nbtscan.c needs to be reset to NULL during
xact abort.

This should be fixed in CVS too.

Yes, absolutely.

Is this something that can be easily back-patched for v6.5.2?

I will patch this in both current and REL6_5. But, although this
is clearly a bug, I am not at all convinced that it explains
Adriaan's problem. I think more creepie-crawlies lurk nearby :-(

regards, tom lane

#9Adriaan Joubert
a.joubert@albourne.com
In reply to: Tom Lane (#8)
Re: [HACKERS] ERROR: btree scan list trashed ??

I will patch this in both current and REL6_5. But, although this
is clearly a bug, I am not at all convinced that it explains
Adriaan's problem. I think more creepie-crawlies lurk nearby :-(

Hmm, I made the changes and I only got three errors out of the system
today. So it is not fixed, although perhaps improved (or was I just
lucky?). I've been locking tables more restrictively, so this may have
helped as well. I definitely think this has something to do with
concurrent accesses to the same index. It always seems to start
happening as the the tables start getting updates more rapidly.

Another thought: an index on a table that gets updated sometimes through
a PL trigger is an index on a user-defined type (the bitmask type I
posted a while ago). Could this have something to do with a btree index
on a user-defined type? I'll drop that index and see whether it makes a
difference. All indexes on other tables that are touched are int4.

Thanks for all the help, Tom!

Adriaan

#10Adriaan Joubert
a.joubert@albourne.com
In reply to: Tom Lane (#8)
Re: [HACKERS] ERROR: btree scan list trashed ??

OK, I've dropped my user-defined type index and it hasn't made any
difference. I've had quite a few of the following again:

UPDATE TasksIds SET qstate=8::bit1 where task=358 and id=5654
ERROR: btree scan list trashed; can't find 0x1401744a0

I've got a lot of logging switched on, and these do not seem to be
preceded by errors. Since patching it the system seems to recover ok, so
I'm wondering whether this could be a caching issue. I think I will just
lock all tables in their entirety now, and see whether that fixes it
(there goes my MVCC performance boost 8-(). I still think it has
something to do with concurrent access to the indices.

If anybody has any more suggestions of what I could try, please let me
know.

Cheers,

Adriaan

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Adriaan Joubert (#10)
Re: [HACKERS] ERROR: btree scan list trashed ??

Adriaan Joubert <a.joubert@albourne.com> writes:

OK, I've dropped my user-defined type index and it hasn't made any
difference. I've had quite a few of the following again:

UPDATE TasksIds SET qstate=8::bit1 where task=358 and id=5654
ERROR: btree scan list trashed; can't find 0x1401744a0

I've got a lot of logging switched on, and these do not seem to be
preceded by errors. Since patching it the system seems to recover ok, so
I'm wondering whether this could be a caching issue. I think I will just
lock all tables in their entirety now, and see whether that fixes it
(there goes my MVCC performance boost 8-(). I still think it has
something to do with concurrent access to the indices.

Let us know whether going to full locking makes any difference.

I am currently wondering whether this is a porting issue (64-bit vs
32-bit pointers). If it only happens on 64-bit platforms, that would
explain why we haven't seen many similar reports. Unfortunately,
that theory provides little useful guidance about where to look :-(

regards, tom lane