corrupt pages detected by enabling checksums

Started by Jeff Janesabout 13 years ago87 messageshackers
Jump to latest
#1Jeff Janes
jeff.janes@gmail.com

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

So, the page may be corrupt without checksums as well, but it just
happens to be hidden for the same reason. Can you try to reproduce it
without -k?

No, things run (seemingly) fine without -k.

And on the checkin right before checksums were added?
Without checksums, you'll need to use pg_filedump (or similar) to find
whether an error has happened.

I'll work on it, but it will take awhile to figure out exactly how to use
pg_filedump to do that, and how to automate that process and incorporate it
into the harness.

In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and the
problem occurs there, so if the problem is not the checksums themselves
(and I agree it probably isn't) it must have been introduced before that
rather than after.

Cheers,

Jeff

#2Andres Freund
andres@anarazel.de
In reply to: Jeff Janes (#1)
Re: corrupt pages detected by enabling checksums

On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

And then the error:

27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517
27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589

looking at xlog from that time, the first records are:
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint: redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid: 843781
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file create: base/16384/835589
rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update: rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0

So the file was created after the last checkpoint, so no full pages need
to be logged. And we theoretically should be able to recovery all things
like torn pages from the WAL since that should cover everything that
happened to that file.

The bkp block 1 indicated in the 4th record above is the only one in
that period of WAL btw.

Looking a bit more, the last page that's covered by WAL is:
rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert: rel 1663/16384/835589; tid 44/56

which is far earlier than the "block 90" the error is found in unless i
misunderstand something. Which is strange, since non-zero content to
those pages shouldn't go to disk until the respective WAL is
flushed. Huh, are we missing something here?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 01:52:41 +0200, Andres Freund wrote:

On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

And then the error:

27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517
27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589

looking at xlog from that time, the first records are:
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint: redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid: 843781
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file create: base/16384/835589
rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update: rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0

So the file was created after the last checkpoint, so no full pages need
to be logged. And we theoretically should be able to recovery all things
like torn pages from the WAL since that should cover everything that
happened to that file.

The bkp block 1 indicated in the 4th record above is the only one in
that period of WAL btw.

Looking a bit more, the last page that's covered by WAL is:
rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert: rel 1663/16384/835589; tid 44/56

which is far earlier than the "block 90" the error is found in unless i
misunderstand something. Which is strange, since non-zero content to
those pages shouldn't go to disk until the respective WAL is
flushed. Huh, are we missing something here?

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Janes (#1)
Re: corrupt pages detected by enabling checksums

On Wed, 2013-04-03 at 15:57 -0700, Jeff Janes wrote:

You don't know that the cluster is in the bad state until after it
goes through recovery because most crashes recover perfectly fine. So
it would have to make a side-copy of the cluster after the crash, then
recover the original and see how things go, then either retain or
delete the side-copy. Unfortunately my testing harness can't do this
at the moment, because the perl script storing the consistency info
needs to survive over the crash and recovery. It might take
me awhile to figure out how to make it do this.

Hmm... you're right, that is difficult to integrate into a test.

Another approach is to expand PageHeaderIsValid in a pre-checksums
version to do some extra checks (perhaps the same checks as pg_filedump
does). It might be able to at least detect simple cases, like all zeros
between pd_upper and pd_special (when pd_upper < pd_special). Still not
easy, but maybe more practical than saving the directory like that. We
may even want a GUC for that to aid future debugging (obviously it would
have a performance impact).

The other 3 files in it constitute the harness. It is a quite a mess,
with some hard-coded paths. The just-posted fix for wal_keep_segments
will also have to be applied.

I'm still trying to reproduce the problem myself. I keep trying simpler
versions of your test and I don't see the problem. It's a little awkward
to try to run the full version of your test (and I'm jumping between
code inspection and various other ideas).

I'll work on it, but it will take awhile to figure out exactly how to
use pg_filedump to do that, and how to automate that process and
incorporate it into the harness.

It might be more productive to try to expand PageHeaderIsValid as I
suggested above.

In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and
the problem occurs there, so if the problem is not the checksums
themselves (and I agree it probably isn't) it must have been
introduced before that rather than after.

Thank you for all of your work on this. I have also attached another
test patch that disables most of the complex areas of the checksums
patch (VM bits, hint bits, etc.). If you apply different portions of the
patch (or the whole thing), it will help eliminate possibilities. In
particular, eliminating the calls to PageSetAllVisible+visibilitymap_set
could tell us a lot.

Also, the first data directory you posted had a problem with a page that
appeared to be a new page (otherwise I would have expected either old or
new data at the end). Do you think this might be a problem only for new
pages? Or do you think your test just makes it likely that many writes
will happen on new pages?

I did find one potential problem in the checksums code (aside from my
previous patch involving wal_level=archive): visibilitymap_set is called
sometimes before the heap page is marked dirty. I will examine a little
more closely, but I expect that to require another patch. Not sure if
that could explain this problem or not.

Regards,
Jeff Davis

Attachments:

checksums-check.patchtext/x-patch; charset=ISO-8859-1; name=checksums-check.patchDownload+12-10
#5Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#3)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 02:28:32 +0200, Andres Freund wrote:

On 2013-04-04 01:52:41 +0200, Andres Freund wrote:

On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

By now I am pretty sure the issue is that the end-of-wal is detected too
early. Above the end is detected at 1/31169A38, but the next WAL file
contains valid data:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000032 -n 10
rmgr: Heap2 len (rec/tot): 26/ 58, tx: 0, lsn: 1/32000030, prev 1/31FFFFD8, bkp: 0000, desc: clean: rel 1663/16384/835589; blk 122 remxid 26328926
rmgr: Heap len (rec/tot): 51/ 83, tx: 26328964, lsn: 1/32000070, prev 1/32000030, bkp: 0000, desc: update: rel 1663/16384/835589; tid 122/191 xmax 26328964 ; new tid 129/140 xmax 0
rmgr: Heap2 len (rec/tot): 30/ 62, tx: 0, lsn: 1/320000C8, prev 1/32000070, bkp: 0000, desc: clean: rel 1663/16384/835589; blk 63 remxid 26328803
rmgr: Btree len (rec/tot): 34/ 66, tx: 26328964, lsn: 1/32000108, prev 1/320000C8, bkp: 0000, desc: insert: rel 1663/16384/835590; tid 25/14

That would explains exactly those symptopms, wouldn't it? Whats causing
that - I am not sure yet.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: corrupt pages detected by enabling checksums

andres@anarazel.de (Andres Freund) writes:

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: corrupt pages detected by enabling checksums

On 2013-04-03 20:45:51 -0400, Tom Lane wrote:

andres@anarazel.de (Andres Freund) writes:

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.

Nope, thats from the heap, and the LSNs are *newer* than what startup
recovered to. I am pretty sure by now we are missing out on valid WAL, I
am just not sure why.

Unfortunately we can't easily diagnose what happened at:
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
since the startup process wrote its end of recovery checkpoint there:
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31169A68, prev 1/31169A38, bkp: 0000, desc: checkpoint: redo 1/31169A68; tli 1; prev tli 1; fpw true; xid 0/26254999; oid 843781; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown

Starting from a some blocks in that wal segments later:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000031 -s 1/3116c000 -n 10
first record is after 1/3116C000, at 1/3116D9D8, skipping over 6616 bytes
rmgr: Heap len (rec/tot): 51/ 83, tx: 26254999, lsn: 1/3116D9D8, prev 1/3116BA20, bkp: 0000, desc: update: rel 1663/16384/835589; tid 38/148 xmax 26254999 ; new tid 44/57 xmax 0
rmgr: Btree len (rec/tot): 34/ 66, tx: 26254999, lsn: 1/3116DA30, prev 1/3116D9D8, bkp: 0000, desc: insert: rel 1663/16384/835590; tid 25/319
rmgr: Heap len (rec/tot): 51/ 83, tx: 26255000, lsn: 1/3116DA78, prev 1/3116DA30, bkp: 0000, desc: update: rel 1663/16384/835589; tid 19/214 xmax 26255000 ; new tid 44/58 xmax 0

the records continue again.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#7)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 02:58:43 +0200, Andres Freund wrote:

On 2013-04-03 20:45:51 -0400, Tom Lane wrote:

andres@anarazel.de (Andres Freund) writes:

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.

Nope, thats from the heap, and the LSNs are *newer* than what startup
recovered to. I am pretty sure by now we are missing out on valid WAL, I
am just not sure why.

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Remember, as explained in other mails, I am pretty sure the end of wal
was errorneously detected, resulting in the following error:
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68

Not that without a *hardware* crash end of wal is normally found first
by the checks for wrong LSNs or wrong record lengths. Not here.

My theory is that a page that was full page written changed while it was
inserted into the WAL which caused the error. A possibly scenario for
that would e.g. two concurrent calls to MarkBufferDirtyHint().
Note:
* 2. The caller might have only share-lock instead of exclusive-lock on the
* buffer's content lock.

if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
if (DataChecksumsEnabled() && (bufHdr->flags & BM_PERMANENT))
{
MyPgXact->delayChkpt = delayChkpt = true;
lsn = XLogSaveBufferForHint(buffer); /* PART 1 */
}
}

LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);

if (!(bufHdr->flags & BM_DIRTY))
{
dirtied = true; /* Means "will be dirtied by this action" */

/*
* Set the page LSN if we wrote a backup block. We aren't
* supposed to set this when only holding a share lock but
* as long as we serialise it somehow we're OK. We choose to
* set LSN while holding the buffer header lock, which causes
* any reader of an LSN who holds only a share lock to also
* obtain a buffer header lock before using PageGetLSN().
* Fortunately, thats not too many places.
*
* If checksums are enabled, you might think we should reset the
* checksum here. That will happen when the page is written
* sometime later in this checkpoint cycle.
*/
if (!XLogRecPtrIsInvalid(lsn))
PageSetLSN(page, lsn); /* PART 2 */
}

So XLogSaveBufferForHint(buffer) is called for all buffers which are not
yet dirty. Without any locks. It just does:

XLogSaveBufferForHint(Buffer buffer)
{
rdata[0].data = (char *) (&watermark);
rdata[0].len = sizeof(int);
rdata[0].next = &(rdata[1]);

rdata[1].data = NULL;
rdata[1].len = 0;
rdata[1].buffer = buffer;
rdata[1].buffer_std = true;
rdata[1].next = NULL;

return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

I think what happens is that that one backend grabs WALInsertLock first,
it computes a full page write for the buffer, including a CRC. And
returns a valid LSN. Then it continues towards the PageSetLSN() after
LockBufHdr().
Allthewhile the second backend notices that WALInsertLock is free again
and continues towards the WALInsertLock protected parts of XLogInsert().

Only that it already has done that part:
/*
* Calculate CRC of the data, including all the backup blocks
*
* Note that the record header isn't added into the CRC initially since we
* don't know the prev-link yet. Thus, the CRC will represent the CRC of
* the whole record in the order: rdata, then backup blocks, then record
* header.
*/
INIT_CRC32(rdata_crc);
for (rdt = rdata; rdt != NULL; rdt = rdt->next)
COMP_CRC32(rdata_crc, rdt->data, rdt->len);

It goes on to write all the data to the WAL after successfully getting
WALInsertLock:
while (write_len)
{
while (rdata->data == NULL)
rdata = rdata->next;

if (freespace > 0)
{
if (rdata->len > freespace)
{
memcpy(Insert->currpos, rdata->data, freespace);
rdata->data += freespace;
rdata->len -= freespace;
write_len -= freespace;
}
else
{
memcpy(Insert->currpos, rdata->data, rdata->len);
freespace -= rdata->len;
write_len -= rdata->len;
Insert->currpos += rdata->len;
rdata = rdata->next;
continue;
}
}

/* Use next buffer */
updrqst = AdvanceXLInsertBuffer(false);
curridx = Insert->curridx;
/* Mark page header to indicate this record continues on the page */
Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
Insert->currpage->xlp_rem_len = write_len;
freespace = INSERT_FREESPACE(Insert);
}

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

Does this explanation make sense?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#8)
Re: corrupt pages detected by enabling checksums

On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Well spotted. (I think you mean checksums patch).

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

When we write out a buffer with checksums enabled, we take a copy of
the buffer so that the checksum is consistent, even while other
backends may be writing hints to the same bufer.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

copy_before_XLOG_HINT.v1.patchapplication/octet-stream; name=copy_before_XLOG_HINT.v1.patchDownload+22-1
#10Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#9)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 13:30:40 +0100, Simon Riggs wrote:

On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Well spotted. (I think you mean checksums patch).

Heh, yes. I was slightly tired at that point ;)

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

Well, there might be other bugs we can't see yet... But lets hope not.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

When we write out a buffer with checksums enabled, we take a copy of
the buffer so that the checksum is consistent, even while other
backends may be writing hints to the same bufer.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

Unfortunately I find that approach unacceptably ugly. I don't think its
ok to make an already *very* complicated function (XLogInsert()) in one
of the most complicated files (xlog.c) even more complicated. It
literally took me years to understand a large percentage of it.
I even think the XLOG_HINT escape hatch should be taken out and be
replaced by something outside of XLogInsert().

Don't think that approach would work with as few lines anyway, since you
need to properly pass it into XLogCheckBuffer() et al which checks the
buffer tags and such - which it needs to properly compute the struct
BkpBlock. Also we iterate over rdata->page for the CRC computation.

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

Something like:

XLogRecPtr
XLogSaveBufferForHint(Buffer buffer)
{
XLogRecPtr recptr = InvalidXLogRecPtr;
XLogRecPtr lsn;
XLogRecData rdata[2];
BkbBlock bkp;

/*
* make sure no checkpoint can happen while we decide about logging
* something or not, since we don't hold any lock preventing that
* otherwise.
*/
Assert(MyPgXact->delayChkpt);

/* update RedoRecPtr */
GetRedoRecPtr();

/* setup phony rdata element */
rdata[0].buffer = buffer;
rdata[0].buffer_std = true; /* is that actually ok? */

/* force full pages on, otherwise checksums won't work? */
if (XLogCheckBuffer(rdata, true, &lsn, &bkp))
{
char copied_buffer[BLCKSZ];

/*
* copy buffer so we don't have to worry about
* concurrent hint bit or lsn updates. We assume pd_lower/upper
* cannot be changed without an exclusive lock, so the contents
* bkp are not racy.
*/
memcpy(copied_buffer, BufferGetBlock(buffer), BLCKSZ);

rdata[0].data = bkp;
rdata[0].len = sizeof(BkbBlock);
rdata[0].buffer = InvalidBuffer;
rdata[0].buffer_std = false;
rdata[0].next = &(rdata[1]);

rdata[1].data = copied_buffer;
rdata[1].len = BLCKSZ;
rdata[1].buffer = InvalidBuffer;
rdata[1].buffer_std = false;
rdata[1].next = NULL;

recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

return recptr;
}

That should work?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#10)
Re: corrupt pages detected by enabling checksums

On 4 April 2013 15:53, Andres Freund <andres@2ndquadrant.com> wrote:

Unfortunately I find that approach unacceptably ugly.

Yeh. If we can confirm its a fix we can discuss a cleaner patch and
that is much better.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#10)
Re: corrupt pages detected by enabling checksums

Andres,

Thank you for diagnosing this problem!

On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

I like this approach. It may have some performance impact though,
because there are a couple extra spinlocks taken, and an extra memcpy.
The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

Another possible approach is to drop the lock on the buffer and
re-acquire it in exclusive mode after we find out we'll need to do
XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
longer, but would avoid the memcpy. I haven't really thought through the
details.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#12)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 12:59:36 -0700, Jeff Davis wrote:

Andres,

Thank you for diagnosing this problem!

On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

I like this approach. It may have some performance impact though,
because there are a couple extra spinlocks taken, and an extra memcpy.

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

Another possible approach is to drop the lock on the buffer and
re-acquire it in exclusive mode after we find out we'll need to do
XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
longer, but would avoid the memcpy. I haven't really thought through the
details.

That sounds like it would be prone to deadlocks. So I would dislike to
go there.

Will write up a patch tomorrow.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#9)
Re: corrupt pages detected by enabling checksums

On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

This brings up a pretty frightening possibility to me, unrelated to data
checksums. If a bit gets twiddled in the WAL file due to a hardware issue
or a "cosmic ray", and then a crash happens, automatic recovery will stop
early with the failed WAL checksum with an innocuous looking message. The
system will start up but will be invisibly inconsistent, and will proceed
to overwrite that portion of the WAL file which contains the old data (real
data that would have been necessary to reconstruct, once the corruption is
finally realized ) with an end-of-recovery checkpoint record and continue
to chew up real data from there.

I don't know a solution here, though, other than trusting your hardware.
Changing timelines upon ordinary crash recovery, not just media recovery,
seems excessive but also seems to be exactly what timelines were invented
for, right?

Back to the main topic here, Jeff Davis mentioned earlier "You'd still
think this would cause incorrect results, but...". I didn't realize the
significance of that until now. It does produce incorrect query results.
I was just bailing out before detecting them. Once I specify
ignore_checksum_failure=1
my test harness complains bitterly about the data not being consistent with
what the Perl program knows it is supposed to be.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

I've tested your patch it and it seems to do the job. It has survived far
longer than unpatched ever did, with neither checksum warnings, nor
complaints of inconsistent data. (I haven't analyzed the code much, just
the results, and leave the discussion of the best approach to others)

Thanks,

Jeff

#15Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Janes (#14)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:

This brings up a pretty frightening possibility to me, unrelated to
data checksums. If a bit gets twiddled in the WAL file due to a
hardware issue or a "cosmic ray", and then a crash happens, automatic
recovery will stop early with the failed WAL checksum with
an innocuous looking message. The system will start up but will be
invisibly inconsistent, and will proceed to overwrite that portion of
the WAL file which contains the old data (real data that would have
been necessary to reconstruct, once the corruption is finally realized
) with an end-of-recovery checkpoint record and continue to chew up
real data from there.

I've been worried about that for a while, and I may have even seen
something like this happen before. We could perhaps do some checks, but
in general it seems hard to solve without writing flushing some data to
two different places. For example, you could flush WAL, and then update
an LSN stored somewhere else indicating how far the WAL has been
written. Recovery could complain if it gets an error in the WAL before
that point.

But obviously, that makes WAL flushes expensive (in many cases, about
twice as expensive).

Maybe it's not out of the question to offer that as an option if nobody
has a better idea. Performance-conscious users could place the extra LSN
on an SSD or NVRAM or something; or maybe use commit_delay or async
commits. It would only need to store a few bytes.

Streaming replication mitigates the problem somewhat, by being a second
place to write WAL.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#13)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

Good point.

The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

I was thinking more about the WAL size, but I don't have a strong
opinion.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#15)
Re: corrupt pages detected by enabling checksums

Jeff Davis <pgsql@j-davis.com> writes:

On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:

This brings up a pretty frightening possibility to me, unrelated to
data checksums. If a bit gets twiddled in the WAL file due to a
hardware issue or a "cosmic ray", and then a crash happens, automatic
recovery will stop early with the failed WAL checksum with
an innocuous looking message. The system will start up but will be
invisibly inconsistent, and will proceed to overwrite that portion of
the WAL file which contains the old data (real data that would have
been necessary to reconstruct, once the corruption is finally realized
) with an end-of-recovery checkpoint record and continue to chew up
real data from there.

I've been worried about that for a while, and I may have even seen
something like this happen before. We could perhaps do some checks, but
in general it seems hard to solve without writing flushing some data to
two different places. For example, you could flush WAL, and then update
an LSN stored somewhere else indicating how far the WAL has been
written. Recovery could complain if it gets an error in the WAL before
that point.

But obviously, that makes WAL flushes expensive (in many cases, about
twice as expensive).

Maybe it's not out of the question to offer that as an option if nobody
has a better idea. Performance-conscious users could place the extra LSN
on an SSD or NVRAM or something; or maybe use commit_delay or async
commits. It would only need to store a few bytes.

At least on traditional rotating media, this is only going to perform
well if you dedicate two drives to the purpose. At which point you
might as well just say "let's write two copies of WAL". Or maybe three
copies, so that you can take a vote when they disagree. While this is
not so unreasonable on its face for ultra-high-reliability requirements,
I can't escape the feeling that we'd just be reinventing software RAID.
There's no reason to think that we can deal with this class of problems
better than the storage system can.

Streaming replication mitigates the problem somewhat, by being a second
place to write WAL.

Yeah, if you're going to do this at all it makes more sense for the
redundant copies to be on other machines. So the questions that that
leads to are how smart is our SR code about dealing with a master that
tries to re-send WAL regions it already sent, and what a slave ought to
do in such a situation if the new data doesn't match.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#17)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-04-04 at 21:06 -0400, Tom Lane wrote:

I can't escape the feeling that we'd just be reinventing software RAID.
There's no reason to think that we can deal with this class of problems
better than the storage system can.

The goal would be to reliably detect a situation where WAL that has been
flushed successfully was later corrupted; not necessarily to recover
when we find it. Unless it's something like ZFS, I don't think most
software RAID will reliably detect corruption.

A side benefit is that it would also help catch bugs like this in the
future.

I'm not advocating for this particular solution, but I do concur with
Jeff Janes that it's a little bit scary and that we can entertain some
ideas about how to mitigate it.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Florian Pflug
fgp@phlo.org
In reply to: Jeff Janes (#14)
Re: corrupt pages detected by enabling checksums

On Apr4, 2013, at 23:21 , Jeff Janes <jeff.janes@gmail.com> wrote:

This brings up a pretty frightening possibility to me, unrelated to data checksums. If a bit gets twiddled in the WAL file due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with the failed WAL checksum with an innocuous looking message. The system will start up but will be invisibly inconsistent, and will proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessary to reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continue to chew up real data from there.

Maybe we could scan forward to check whether a corrupted WAL record is followed by one or more valid ones with sensible LSNs. If it is, chances are high that we haven't actually hit the end of the WAL. In that case, we could either log a warning, or (better, probably) abort crash recovery. The user would then need to either restore the broken WAL segment from backup, or override the check by e.g. setting recovery_target_record="invalid_record". (The default would be recovery_target_record="last_record". The name of the GUC tries to be consistent with existing recovery.conf settings, even though it affects crash recovery, not archive recovery.)

Corruption of fields which we require to scan past the record would cause false negatives, i.e. no trigger an error even though we do abort recovery mid-way through. There's a risk of false positives too, but they require quite specific orderings of writes and thus seem rather unlikely. (AFAICS, the OS would have to write some parts of record N followed by the whole of record N+1 and then crash to cause a false positive).

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#16)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 17:39:16 -0700, Jeff Davis wrote:

On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

Good point.

The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

I was thinking more about the WAL size, but I don't have a strong
opinion.

I was just a bit dense. No idea what I missed there.

How does the attached version look? I verified that it survives
recovery, but not more.

Jeff, any chance you can run this for a round with your suite?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-checksums-Log-hint-bit-writes-in-a-concurrency-safe-.patchtext/x-patch; charset=us-asciiDownload+136-106
#21Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#20)
#22Jeff Davis
pgsql@j-davis.com
In reply to: Florian Pflug (#19)
#23Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Andres Freund (#20)
#24Jeff Davis
pgsql@j-davis.com
In reply to: Jaime Casanova (#23)
#25Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Jeff Davis (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#21)
#27Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#20)
#28Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Jeff Janes (#27)
#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#26)
#30Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#29)
#31Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#26)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#26)
#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#32)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#30)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#34)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#35)
#37Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#35)
#38Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#37)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#39)
#41Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#41)
#43Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#40)
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#42)
#45Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#42)
#46Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#44)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#46)
#48Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#47)
#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#48)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#43)
#51Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#50)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#51)
#53Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#52)
#54Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jeff Davis (#22)
#55Jeff Davis
pgsql@j-davis.com
In reply to: Jim Nasby (#54)
#56Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jeff Davis (#55)
#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Jim Nasby (#56)
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#57)
#59Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#58)
#60Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#59)
#61Jeff Davis
pgsql@j-davis.com
In reply to: Jim Nasby (#56)
#62Jeff Davis
pgsql@j-davis.com
In reply to: Bruce Momjian (#60)
#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#60)
#64Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#63)
#65Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#64)
#66Bruce Momjian
bruce@momjian.us
In reply to: Jeff Janes (#1)
#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#66)
#68Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#64)
#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#67)
#70Jeff Janes
jeff.janes@gmail.com
In reply to: Bruce Momjian (#66)
#71Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#70)
#72Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#68)
#73Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#68)
#74Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#66)
#75Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#72)
#76Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jeff Davis (#61)
#77Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jeff Janes (#70)
#78Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jim Nasby (#77)
In reply to: Jim Nasby (#77)
In reply to: Jon Nelson (#78)
#81Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Kenneth Marshall (#80)
#82Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#78)
#83Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Andres Freund (#82)
#84Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#83)
#85Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#84)
#86Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#85)
#87Simon Riggs
simon@2ndQuadrant.com
In reply to: Jon Nelson (#83)