Invalid headers and xlog flush failures

Started by Bricklen Andersonabout 21 years ago10 messagesgeneral
Jump to latest
#1Bricklen Anderson
banderson@presinet.com

Hi all,

I recently came across some apparent corruption in one of our databases around a month ago.
version: postgresql 8 (originally 8r3, now at 8.0.1), debian box

The messages that we were originally getting in our syslog were about invalid page headers. After
googling around, then dumping the page with pg_filedump, I decided to drop and recreate the affected
table. This seemed to work for a while, until this message cropped up in the syslog during a heavy load:

Feb  1 11:17:49 dev94 postgres[4959]: [470-2] 2005-02-01 11:17:49 PST> CONTEXT:  writing block 47272
of relation 1663/17235/57800
Feb  1 11:17:49 dev94 postgres[4959]: [471-1] 2005-02-01 11:17:49 PST> WARNING:  could not write
block 47272 of 1663/17235/57800
Feb  1 11:17:49 dev94 postgres[4959]: [471-2] 2005-02-01 11:17:49 PST> DETAIL:  Multiple failures
--- write error may be permanent.
Feb  1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR:  xlog flush request
972/FC932854 is not satisfied --- flushed only to 73/86D2640

This maps to an index. I reindexed it (and several other tables), and a 3 hours later, restarted my
load process. Shortly after that, the same thing happened again (with different numbers this time):

Feb  1 14:36:05 dev94 postgres[12887]: [626-2] 2005-02-01 14:36:05 PST> CONTEXT:  writing block 7502
of relation 1663/17235/151565
Feb  1 14:36:05 dev94 postgres[12887]: [627-1] 2005-02-01 14:36:05 PST> WARNING:  could not write
block 7502 of 1663/17235/151565
Feb  1 14:36:05 dev94 postgres[12887]: [627-2] 2005-02-01 14:36:05 PST> DETAIL:  Multiple failures
--- write error may be permanent.
Feb  1 14:36:06 dev94 postgres[12887]: [628-1] 2005-02-01 14:36:06 PST> ERROR:  xlog flush request
973/3EF36C2C is not satisfied --- flushed only to 73/419878B4

Both sets are repeated continuously through the syslog.

I pursued some references to XID wraparound, but that didn't seem likely from what I could see
(unless I'm misreading the numbers)
SELECT datname, age(datfrozenxid) FROM pg_database where datname='dev17';
datname | age
---------+------------
dev17 | 1074008776

Here is a pg_filedump of 151565:
$pg_filedump -i -f -R 7502 /var/postgres/data/base/17235/151565

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: /var/postgres/data/base/17235/151565
* Options used: -i -f -R 7502
*
* Dump created on: Tue Feb 1 14:34:14 2005
*******************************************************************

Block 7502 ********************************************************
<Header> -----
Block Offset: 0x03a9c000 Offsets: Lower 988 (0x03dc)
Block: Size 8192 Version 2 Upper 3336 (0x0d08)
LSN: logid 115 recoff 0x39e855f4 Special 8176 (0x1ff0)
Items: 242 Free Space: 2348
Length (including item array): 992

Error: Invalid header information.

0000: 73000000 f455e839 01000000 dc03080d s....U.9........
0010: f01f0220 cc912800 e0912800 f4912800 ... ..(...(...(.
0020: 08922800 1c922800 30922800 44922800 ..(...(.0.(.D.(.
<snipped>

<Data> ------
Item 1 -- Length: 20 Offset: 4556 (0x11cc) Flags: USED
Block Id: 9016 linp Index: 2 Size: 20
Has Nulls: 0 Has Varwidths: 16384

11cc: 00003823 02001440 0b000000 022000cf ..8#...@..... ..
11dc: 66f06500 f.e.

Item 2 -- Length: 20 Offset: 4576 (0x11e0) Flags: USED
Block Id: 9571 linp Index: 8 Size: 20
Has Nulls: 0 Has Varwidths: 16384

11e0: 00006325 08001440 0b000000 022000cf ..c%...@..... ..
11f0: 66f06400 f.d.

Item 3 -- Length: 20 Offset: 4596 (0x11f4) Flags: USED
Block Id: 9571 linp Index: 3 Size: 20
Has Nulls: 0 Has Varwidths: 16384

11f4: 00006325 03001440 0b000000 022000cf ..c%...@..... ..
1204: 66f06400 f.d.
<snipped>

<Special Section> -----
BTree Index Section:
Flags: 0x0001 (LEAF)
Blocks: Previous (1314) Next (1958) Level (0)

1ff0: 22050000 a6070000 00000000 01000000 "...............

*** End of Requested Range Encountered. Last Block Read: 7502 ***

Can anyone suggest what I should try next, or if you need more information, I'll happily supply what
I can.

Inline are the changes I made to pg_filedump to get it to compile and work, as such I can't attest
to pg_filedump's accuracy (which could be the source of those invalid header messages)
#########################################
--- pg_filedump.c.old       2004-02-23 12:58:58.000000000 -0800
+++ ../pg_filedump-3.0/pg_filedump.c.new 2005-01-31 09:24:36.000000000
-0800
@@ -742,8 +742,8 @@
           printf ("  XID: min (%u)  CMIN|XMAX: %u  CMAX|XVAC: %u\n"
                   "  Block Id: %u  linp Index: %u   Attributes: %d Size: %d\n",
-                 htup->t_xmin, htup->t_field2.t_cmin,
-                 htup->t_field3.t_cmax,
+                 htup->t_choice.t_heap.t_xmin,htup->t_choice.t_heap.t_cmin,
+                 htup->t_choice.t_heap.t_field4.t_cmax,
                   ((uint32)
                    ((htup->t_ctid.ip_blkid.bi_hi << 16) | (uint16) htup->t_ctid.ip_blkid.bi_lo)),
htup->t_ctid.ip_posid,
@@ -768,8 +768,8 @@
             strcat (flagString, "HASCOMPRESSED|");
           if (infoMask & HEAP_HASOID)
             strcat (flagString, "HASOID|");
-         if (infoMask & HEAP_XMAX_IS_XMIN)
-           strcat (flagString, "XMAX_IS_XMIN|");
+         //if (infoMask & HEAP_XMAX_IS_XMIN)
+         //  strcat (flagString, "XMAX_IS_XMIN|");
           if (infoMask & HEAP_XMAX_UNLOGGED)
             strcat (flagString, "XMAX_UNLOGGED|");
           if (infoMask & HEAP_XMIN_COMMITTED)
@@ -1042,7 +1042,7 @@
               "   Previous Checkpoint Record: Log File (%u) Offset (0x%08x)\n"
               "  Last Checkpoint Record Redo: Log File (%u) Offset (0x%08x)\n"
               "                |-       Undo: Log File (%u) Offset (0x%08x)\n"
-             "                |- StartUp ID: %u\n"
+             //"                |- StartUp ID: %u\n"
               "                |-   Next XID: %u\n"
               "                |-   Next OID: %u\n"
               "                 -       Time: %s"
@@ -1064,7 +1064,7 @@
               controlData->prevCheckPoint.xlogid,
               controlData->prevCheckPoint.xrecoff,checkPoint->redo.xlogid,
               checkPoint->redo.xrecoff, checkPoint->undo.xlogid,
-             checkPoint->undo.xrecoff, checkPoint->ThisStartUpID,
+             checkPoint->undo.xrecoff, /*checkPoint->ThisStartUpID,*/
               checkPoint->nextXid, checkPoint->nextOid,
               ctime (&checkPoint->time), controlData->blcksz,
               controlData->relseg_size, controlData->nameDataLen,
#########################################

Obviously, the bottom 3 chunks aren't very useful, but a cursory examination of the relevant headers
didn't make it obvious what these should be changed to.

Sorry for the long-winded email, and thanks for taking the time to read this far!

Cheers,

Bricklen

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bricklen Anderson (#1)
Re: Invalid headers and xlog flush failures

Bricklen Anderson <BAnderson@PresiNET.com> writes:

Feb 1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR: xlog flush request
972/FC932854 is not satisfied --- flushed only to 73/86D2640

Hmm, have you perhaps played any games with pg_resetxlog in this database?

I would have suggested that maybe this represented on-disk data
corruption, but the appearance of two different but not-too-far-apart
WAL offsets in two different pages suggests that indeed the end of WAL
was up around segment 972 or 973 at one time. And now it's evidently
ending at 73. Not good. What file names do you see in pg_xlog/, and
what does pg_controldata show?

regards, tom lane

#3Bricklen Anderson
banderson@presinet.com
In reply to: Tom Lane (#2)
Re: Invalid headers and xlog flush failures

Tom Lane wrote:

Bricklen Anderson <BAnderson@PresiNET.com> writes:

Feb 1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR: xlog flush request
972/FC932854 is not satisfied --- flushed only to 73/86D2640

Hmm, have you perhaps played any games with pg_resetxlog in this database?

I would have suggested that maybe this represented on-disk data
corruption, but the appearance of two different but not-too-far-apart
WAL offsets in two different pages suggests that indeed the end of WAL
was up around segment 972 or 973 at one time. And now it's evidently
ending at 73. Not good. What file names do you see in pg_xlog/, and
what does pg_controldata show?

regards, tom lane

Hi Tom,

Nope, never touched pg_resetxlog.
My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks.
There are also these: 000000010000007400000000 to 00000001000000740000000B

$ pg_controldata
pg_control version number: 74
Catalog version number: 200411041
Database system identifier: 4738750823096876774
Database cluster state: in production
pg_control last modified: Wed 02 Feb 2005 12:38:22 AM PST
Current log file ID: 115
Next log file segment: 66
Latest checkpoint location: 73/419A4BDC
Prior checkpoint location: 73/419A4B80
Latest checkpoint's REDO location: 73/419A4BDC
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 4161807
Latest checkpoint's NextOID: 176864
Time of latest checkpoint: Wed 02 Feb 2005 12:38:22 AM PST
Database block size: 8192
Blocks per segment of large relation: 131072
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: floating-point numbers
Maximum length of locale name: 128
LC_COLLATE: en_CA
LC_CTYPE: en_CA

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bricklen Anderson (#3)
Re: Invalid headers and xlog flush failures

Bricklen Anderson <BAnderson@PresiNET.com> writes:

Tom Lane wrote:

I would have suggested that maybe this represented on-disk data
corruption, but the appearance of two different but not-too-far-apart
WAL offsets in two different pages suggests that indeed the end of WAL
was up around segment 972 or 973 at one time.

Nope, never touched pg_resetxlog.
My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks.
There are also these: 000000010000007400000000 to 00000001000000740000000B

That seems like rather a lot of files; do you have checkpoint_segments
set to a large value, like 100? The pg_controldata dump shows that the
latest checkpoint record is in the 73/41 file, so presumably the active
end of WAL isn't exceedingly far past that. You've got 200 segments
prepared for future activity, which is a bit over the top IMHO.

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus. I'm back to thinking about dropped bits in RAM or on disk.
IIRC these numbers are all hex, so the extra "9" could come from just
two bits getting turned on that should not be. Might be time to run
memtest86 and/or badblocks.

regards, tom lane

#5Bricklen Anderson
banderson@presinet.com
In reply to: Tom Lane (#4)
Re: Invalid headers and xlog flush failures

Tom Lane wrote:

Bricklen Anderson <BAnderson@PresiNET.com> writes:

Tom Lane wrote:

I would have suggested that maybe this represented on-disk data
corruption, but the appearance of two different but not-too-far-apart
WAL offsets in two different pages suggests that indeed the end of WAL
was up around segment 972 or 973 at one time.

Nope, never touched pg_resetxlog.
My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks.
There are also these: 000000010000007400000000 to 00000001000000740000000B

That seems like rather a lot of files; do you have checkpoint_segments
set to a large value, like 100? The pg_controldata dump shows that the
latest checkpoint record is in the 73/41 file, so presumably the active
end of WAL isn't exceedingly far past that. You've got 200 segments
prepared for future activity, which is a bit over the top IMHO.

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus. I'm back to thinking about dropped bits in RAM or on disk.
IIRC these numbers are all hex, so the extra "9" could come from just
two bits getting turned on that should not be. Might be time to run
memtest86 and/or badblocks.

regards, tom lane

Yes, checkpoint_segments is set to 100, although I can set that lower if you feel that that is more
appropriate. Currently, the system receives around 5-8 million inserts per day (across 3 primary
tables), so I was leaning towards the "more is better" philosophy.

We ran e2fsck with badblocks option last week and didn't turn anything up, along with a couple of
passes with memtest. I will run a full-scale memtest and post any interesting results.

I've also read that kill -9 postmaster is "not a good thing". I honestly can't vouch for whether or
not this may or may not have occurred around the time of the initial creation of this database. It's
possible, since this db started it's life as a development db at 8r3 then was bumped to 8r5, then on
to 8 final where it has become a dev-final db.

Assuming that the memtest passes cleanly, as does another run of badblocks, do you have any more
suggestions on how I should proceed? Should I run for a while with zero_damaged_pages set to true
and accpet the data loss, or just recreate the whole db from scratch?

Thanks again for your help.

Cheers,

Bricklen

#6Bricklen Anderson
banderson@presinet.com
In reply to: Bricklen Anderson (#5)
Re: Invalid headers and xlog flush failures

Bricklen Anderson wrote:

Tom Lane wrote:

Bricklen Anderson <BAnderson@PresiNET.com> writes:

Tom Lane wrote:

I would have suggested that maybe this represented on-disk data
corruption, but the appearance of two different but not-too-far-apart
WAL offsets in two different pages suggests that indeed the end of WAL
was up around segment 972 or 973 at one time.

Nope, never touched pg_resetxlog.
My pg_xlog list ranges from 000000010000007300000041 to
0000000100000073000000FE, with no breaks. There are also these:
000000010000007400000000 to 00000001000000740000000B

That seems like rather a lot of files; do you have checkpoint_segments
set to a large value, like 100? The pg_controldata dump shows that the
latest checkpoint record is in the 73/41 file, so presumably the active
end of WAL isn't exceedingly far past that. You've got 200 segments
prepared for future activity, which is a bit over the top IMHO.

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus. I'm back to thinking about dropped bits in RAM or on disk.
IIRC these numbers are all hex, so the extra "9" could come from just
two bits getting turned on that should not be. Might be time to run
memtest86 and/or badblocks.

regards, tom lane

Yes, checkpoint_segments is set to 100, although I can set that lower if
you feel that that is more appropriate. Currently, the system receives
around 5-8 million inserts per day (across 3 primary tables), so I was
leaning towards the "more is better" philosophy.

We ran e2fsck with badblocks option last week and didn't turn anything
up, along with a couple of passes with memtest. I will run a full-scale
memtest and post any interesting results.

I've also read that kill -9 postmaster is "not a good thing". I honestly
can't vouch for whether or not this may or may not have occurred around
the time of the initial creation of this database. It's possible, since
this db started it's life as a development db at 8r3 then was bumped to
8r5, then on to 8 final where it has become a dev-final db.

Assuming that the memtest passes cleanly, as does another run of
badblocks, do you have any more suggestions on how I should proceed?
Should I run for a while with zero_damaged_pages set to true and accpet
the data loss, or just recreate the whole db from scratch?

memtest86+ ran for over 15 hours with no errors reported.
e2fsck -c completed with no errors reported.

Any ideas on what I should try next? Considering that this db is not in production yet, I _do_ have
the liberty to rebuild the database if necessary. Do you have any further recommendations?

thanks again,

Bricklen

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bricklen Anderson (#6)
Re: Invalid headers and xlog flush failures

Bricklen Anderson <BAnderson@PresiNET.com> writes:

Tom Lane wrote:

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus. I'm back to thinking about dropped bits in RAM or on disk.

memtest86+ ran for over 15 hours with no errors reported.
e2fsck -c completed with no errors reported.

Hmm ... that's not proof your hardware is ok, but it at least puts the
ball back in play.

Any ideas on what I should try next? Considering that this db is not
in production yet, I _do_ have the liberty to rebuild the database if
necessary. Do you have any further recommendations?

If the database isn't too large, I'd suggest saving aside a physical
copy (eg, cp or tar dump taken with postmaster stopped) for forensic
purposes, and then rebuilding so you can get on with your own work.

One bit of investigation that might be worth doing is to look at every
single 8K page in the database files and collect information about the
LSN fields, which are the first 8 bytes of each page. In a non-broken
database all of these should be less than or equal to the current ending
WAL offset (which you can get with pg_controldata if the postmaster is
stopped). We know there are at least two bad pages, but are there more?
Is there any pattern to the bad LSN values? Also it would be useful to
look at each bad page in some detail to see if there's any evidence of
corruption extending beyond the LSN value.

regards, tom lane

#8Alban Hertroys
alban@magproductions.nl
In reply to: Bricklen Anderson (#6)
Re: Invalid headers and xlog flush failures

Bricklen Anderson wrote:

Any ideas on what I should try next? Considering that this db is not in
production yet, I _do_ have the liberty to rebuild the database if
necessary. Do you have any further recommendations?

I recall reading something in this ML about problems with the way that
Ext3 FS recovers a dirty file system, could it be related?

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

#9Bricklen Anderson
banderson@presinet.com
In reply to: Alban Hertroys (#8)
Re: Invalid headers and xlog flush failures

Alban Hertroys wrote:

Bricklen Anderson wrote:

Any ideas on what I should try next? Considering that this db is not
in production yet, I _do_ have the liberty to rebuild the database if
necessary. Do you have any further recommendations?

I recall reading something in this ML about problems with the way that
Ext3 FS recovers a dirty file system, could it be related?

I really have no idea, but we _are_ running an ext3 fs on this particular server.

--
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________

#10Bricklen Anderson
banderson@presinet.com
In reply to: Tom Lane (#7)
Re: Invalid headers and xlog flush failures

Tom Lane wrote:

Bricklen Anderson <BAnderson@PresiNET.com> writes:

Tom Lane wrote:

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus. I'm back to thinking about dropped bits in RAM or on disk.

memtest86+ ran for over 15 hours with no errors reported.
e2fsck -c completed with no errors reported.

Hmm ... that's not proof your hardware is ok, but it at least puts the
ball back in play.

Any ideas on what I should try next? Considering that this db is not
in production yet, I _do_ have the liberty to rebuild the database if
necessary. Do you have any further recommendations?

If the database isn't too large, I'd suggest saving aside a physical
copy (eg, cp or tar dump taken with postmaster stopped) for forensic
purposes, and then rebuilding so you can get on with your own work.

One bit of investigation that might be worth doing is to look at every
single 8K page in the database files and collect information about the
LSN fields, which are the first 8 bytes of each page.

Do you mean this line from pg_filedump's results:

LSN: logid 56 recoff 0x3f4be440 Special 8176 (0x1ff0)

If so, I've set up a shell script that looped all of the files and emitted that line.
It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line.
I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I
believe is the hex of 74?).

In case anyone wants the script that I ran to get the LSN:
#!/bin/sh

for FILE in /var/postgres/data/base/17235/*; do
i=0
echo $FILE >> test_file;
while [ 1==1 ]; do
str=`pg_filedump -R $i $FILE | grep LSN`;
if [ "$?" -eq "1" ]; then
break
fi
echo "$FILE: $str" >> LSN_out;
i=$((i+1));
done
done

In a non-broken database all of these should be less than or equal to the current ending
WAL offset (which you can get with pg_controldata if the postmaster is
stopped). We know there are at least two bad pages, but are there more?
Is there any pattern to the bad LSN values? Also it would be useful to
look at each bad page in some detail to see if there's any evidence of
corruption extending beyond the LSN value.

regards, tom lane

NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig
into it further.

Thanks again for you help, Tom.

Cheers,

Bricklen