PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Started by Valentin Bogdanovover 17 years ago11 messagesbugs
Jump to latest
#1Valentin Bogdanov
valiouk@yahoo.co.uk

Hi All,

I have a database that refuses to start due to the afformentioned error. I am running POstgreSQL 8.1.11 on a Debian Etch box.

Does anyone know what this error means and how to recover from it?

Any help will be very much appreciated.

Thanks,
Val

P.S. Here is the complete output I get

Jan 5 10:36:29 db2 postgres[17111]: [2-1] LOG: database system was interrupted while in recovery at 2009-01-05 10:24:37 GMT
Jan 5 10:36:29 db2 postgres[17111]: [2-2] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
Jan 5 10:36:29 db2 postgres[17111]: [3-1] LOG: checkpoint record is at 122/D080660
Jan 5 10:36:29 db2 postgres[17111]: [4-1] LOG: redo record is at 122/D00060C; undo record is at 0/0; shutdown FALSE
Jan 5 10:36:29 db2 postgres[17111]: [5-1] LOG: next transaction ID: 2664007622; next OID: 521067
Jan 5 10:36:29 db2 postgres[17111]: [6-1] LOG: next MultiXactId: 1; next MultiXactOffset: 0
Jan 5 10:36:29 db2 postgres[17111]: [7-1] LOG: database system was not properly shut down; automatic recovery in progress
Jan 5 10:36:29 db2 postgres[17111]: [8-1] LOG: redo starts at 122/D00060C
Jan 5 10:36:29 db2 postgres[17112]: [2-1] LOG: incomplete startup packet
Jan 5 10:36:29 db2 postgres[17111]: [9-1] LOG: record with zero length at 122/E914B48
Jan 5 10:36:29 db2 postgres[17111]: [10-1] LOG: redo done at 122/E914B20
Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC: failed to re-find parent key in "100924" for split pages 1606/1673
Jan 5 10:36:29 db2 postgres[17110]: [2-1] LOG: startup process (PID 17111) was terminated by signal 6
Jan 5 10:36:29 db2 postgres[17110]: [3-1] LOG: aborting startup due to startup process failure

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Valentin Bogdanov (#1)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

val <valiouk@yahoo.co.uk> writes:

I have a database that refuses to start due to the afformentioned error. I am running POstgreSQL 8.1.11 on a Debian Etch box.

Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Hmm ... I wonder if this is telling us that our patch here was
incomplete?
http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php

At the time we thought this failure could only occur during _bt_pagedel
but you have evidently got a case where a split is failing. It might
just be garden-variety index corruption, or it might be a real bug.

Is this database sufficiently small and non-proprietary that you could
send me a filesystem copy of it (a tarball of all of $PGDATA including
the WAL files)?

regards, tom lane

#3Valentin Bogdanov
valiouk@yahoo.co.uk
In reply to: Tom Lane (#2)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

I have a database that refuses to start due to the

afformentioned error. I am running POstgreSQL 8.1.11 on a
Debian Etch box.

Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC:

failed to re-find parent key in "100924" for split
pages 1606/1673

Is this database sufficiently small and non-proprietary
that you could
send me a filesystem copy of it (a tarball of all of
$PGDATA including
the WAL files)?

I solved my problem by reseting the next transaction ID with the pg_resetxlog utility.

Sorry I cannot send you the database since it is proprietary and is also quiet big, but if there is anything else I can do just let me know.

thanks,
val

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#2)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

On Mon, 2009-01-05 at 14:25 -0500, Tom Lane wrote:

val <valiouk@yahoo.co.uk> writes:

I have a database that refuses to start due to the afformentioned error. I am running POstgreSQL 8.1.11 on a Debian Etch box.

Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Hmm ... I wonder if this is telling us that our patch here was
incomplete?
http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php

At the time we thought this failure could only occur during _bt_pagedel
but you have evidently got a case where a split is failing. It might
just be garden-variety index corruption, or it might be a real bug.

Did you catch this had occurred during recovery?

Can we downgrade the error from PANIC to LOG please? One corrupt index
shouldn't prevent us from restarting the whole server. Plus, if we have
to use pg_resetxlog to get us out of trouble it isn't going to help much
with diagnosis. We can rebuild indexes once server is up.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#4)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, 2009-01-05 at 14:25 -0500, Tom Lane wrote:

Hmm ... I wonder if this is telling us that our patch here was
incomplete?
http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php

At the time we thought this failure could only occur during _bt_pagedel
but you have evidently got a case where a split is failing. It might
just be garden-variety index corruption, or it might be a real bug.

Did you catch this had occurred during recovery?

Yes, I did. Which is one of the reasons I think there might be a real
bug there, but without any evidence to look at it's hard to do much
about it now. (Also, our solution to the underlying problem is quite
different now than it was in 8.1, so I'm doubtful that the bug still
exists in current code even if it's real in 8.1.)

Can we downgrade the error from PANIC to LOG please?

No, that seems utterly unsafe to me. We'd have a corrupt index and
nothing to cause it to get repaired.

regards, tom lane

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#5)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote:

Can we downgrade the error from PANIC to LOG please?

No, that seems utterly unsafe to me. We'd have a corrupt index and
nothing to cause it to get repaired.

We do exactly this with GIN and GIST indexes currently.

I'd rather have a database that works, but has a corrupt index than one
that won't come up at all.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#6)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Simon Riggs <simon@2ndQuadrant.com> writes:

On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote:

No, that seems utterly unsafe to me. We'd have a corrupt index and
nothing to cause it to get repaired.

We do exactly this with GIN and GIST indexes currently.

Which are not used in any system indexes.

I'd rather have a database that works, but has a corrupt index than one
that won't come up at all.

If the btree in question is a critical system index, your value of
"work" is going to be pretty damn small.

regards, tom lane

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#7)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote:

No, that seems utterly unsafe to me. We'd have a corrupt index and
nothing to cause it to get repaired.

We do exactly this with GIN and GIST indexes currently.

Which are not used in any system indexes.

I'd rather have a database that works, but has a corrupt index than one
that won't come up at all.

If the btree in question is a critical system index, your value of
"work" is going to be pretty damn small.

Those are good points.

So if its a system index we can throw a PANIC, else just LOG. Whilst a
corrupt index is annoying in the extreme, a total server outage is not
something we should allow. IMHO.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#8)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Simon Riggs <simon@2ndQuadrant.com> writes:

On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote:

If the btree in question is a critical system index, your value of
"work" is going to be pretty damn small.

So if its a system index we can throw a PANIC, else just LOG. Whilst a
corrupt index is annoying in the extreme, a total server outage is not
something we should allow. IMHO.

I think an appropriate solution would be to institute some mechanism
that forces a reindex of the corrupted index at completion of recovery.
Merely fooling around with message severity levels doesn't fix anything
at all, it just opens the door to more trouble than you've already got.

Whether this is important enough to get done in the near future is
a different discussion...

regards, tom lane

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#9)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

On Thu, 2009-01-08 at 15:04 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote:

If the btree in question is a critical system index, your value of
"work" is going to be pretty damn small.

So if its a system index we can throw a PANIC, else just LOG. Whilst a
corrupt index is annoying in the extreme, a total server outage is not
something we should allow. IMHO.

I think an appropriate solution would be to institute some mechanism
that forces a reindex of the corrupted index at completion of recovery.
Merely fooling around with message severity levels doesn't fix anything
at all, it just opens the door to more trouble than you've already got.

Well you know I agree on the longer term solution.

But with a down server, you just force people to do pg_resetxlog, which
loses both the corruption (probably) and real, useful data (likely) and
*then* they bring up the server. I don't see why we should force people
to take a manual action and lose data to bring up the server. It's not
like they'll just look at it and say how much of a shame it is it won't
start. They will be bringing up the server, somehow, or they get the
sack. IMHO. I'll say no more though; its not an argument.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#10)
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Simon Riggs <simon@2ndQuadrant.com> writes:

But with a down server, you just force people to do pg_resetxlog, which
loses both the corruption (probably) and real, useful data (likely) and
*then* they bring up the server. I don't see why we should force people
to take a manual action and lose data to bring up the server.

That's all fine, but simply reducing the message level from PANIC to LOG
remains an utterly unacceptable "solution". What will happen is that
the server will start, the DBA will go back to sleep after ignoring
(most likely, never even reading) the log message, and the corruption
will get worse. The potential consequences of corruption in a pg_class
index, for example, are just horrid. Frankly I'd rather "rm -rf $PGDATA"
and force someone to go back to their last backup than let them continue
to run with a database that is known to be broken and the system didn't
do anything more to warn them than emit a LOG message someplace.

(No, I'm not seriously proposing that as a recovery technique. But it's
no more irresponsible than ignoring a corruption condition.)

regards, tom lane