Crash while recovering database index relation
Hi,
On one of our test boxen here, weve experienced a corrupted file during
database recovery after box power outage. The specific error message is
PANIC: invalid page header in block 6 of relation "17792"
At this point I fired up a hex dumper to inspect the file, and the last
block in the file (which that error refers to) was clearly garbage
This was on postgres 7.4. The system in question is using ReiserFS, and some
journal transactions were replayed on the same boot as the failed postgres
recovery. I beleive this is significant (see below)
By using postgres single-user database server and zero_damaged_pages option
I manged to get the database up again. There were a LOT of relations with
this problem !
It may be significant that this is an index (primary key) for a relation.
ALL of the files with problems were either indexes or primary keys!
I do NOT believe this was a hardware error. What I think happened is:
- postgres extended some indexes
- reiserfs journalled the metadata
- new file contents got buffered by the kernel in memory
- XLog stuff gets fsync()'d
- Power cycle
- reiserfs replayed metadata journal, extended the files
Probably makes the last blocks in each file invalid!
- postgres attempts to recover from its log, and bumps into the (now
garbage) blocks
I'll see if I can get some time to reproduce this reliably
Guy Thornley
Guy Thornley <guy@esphion.com> writes:
On one of our test boxen here, weve experienced a corrupted file during
database recovery after box power outage. The specific error message is
PANIC: invalid page header in block 6 of relation "17792"
This was on postgres 7.4.
I believe this is fixed in 7.4.1:
2003-12-01 11:53 tgl
* src/backend/storage/buffer/: bufmgr.c (REL7_3_STABLE), bufmgr.c
(REL7_4_STABLE), bufmgr.c: Force zero_damaged_pages to be
effectively ON during recovery from WAL, since there is no need to
worry about damaged pages when we are going to overwrite them
anyway from the WAL. Per recent discussion.
By using postgres single-user database server and zero_damaged_pages option
I manged to get the database up again. There were a LOT of relations with
this problem !
And no sign of corruption after you'd run through the recovery with
zero_damaged_pages? That's what I'd expect if this scenario applies:
the pages will be fixed by WAL recovery, it's just that the recently
added check for broken page headers was interfering :-(
regards, tom lane
PANIC: invalid page header in block 6 of relation "17792"
This was on postgres 7.4.I believe this is fixed in 7.4.1:
...
And no sign of corruption after you'd run through the recovery with
zero_damaged_pages?
I checked them this morning; there isnt.
Sorry for bugging you about something already fixed
That's what I'd expect if this scenario applies:
the pages will be fixed by WAL recovery, it's just that the recently
added check for broken page headers was interfering :-(
What I don't grok is why all the affected files were indexes, and none
of the heap files appeared to have junk pages
Guy Thornley
Guy Thornley <guy@esphion.com> writes:
That's what I'd expect if this scenario applies:
the pages will be fixed by WAL recovery, it's just that the recently
added check for broken page headers was interfering :-(
What I don't grok is why all the affected files were indexes, and none
of the heap files appeared to have junk pages
Hmmm ... that is mildly interesting, but it doesn't rise to the level of
warning bells in my head. At least not yet. Were the indexes involved
all on the same table, or different tables? If the former, it could
just be that that was the last set of changes to be flushed out after an
update of that table. If they were on different tables then it's a more
surprising coincidence. Could happen anyway I suppose --- index pages
are likely to be more heavily accessed than heap pages, and thus less
likely to get flushed out of the buffer cache.
regards, tom lane
What I don't grok is why all the affected files were indexes, and none
of the heap files appeared to have junk pagesHmmm ... that is mildly interesting, but it doesn't rise to the level of
warning bells in my head.
I played around a bit yesterday with an INSERT'ing shell script and a reset
button... I can now, with reasonable confidence, say was pure coincidence
they were all index files. I had the junk pages in normal heap files as well
as index files on several occasions while testing
Were the indexes involved all on the same table, or different tables?
Different tables, which is what aroused my own curosity :)
Guy