corrupt pages detected by enabling checksums

Started by Jeff Janesalmost 13 years ago87 messages
#1Jeff Janes
jeff.janes@gmail.com

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

So, the page may be corrupt without checksums as well, but it just
happens to be hidden for the same reason. Can you try to reproduce it
without -k?

No, things run (seemingly) fine without -k.

And on the checkin right before checksums were added?
Without checksums, you'll need to use pg_filedump (or similar) to find
whether an error has happened.

I'll work on it, but it will take awhile to figure out exactly how to use
pg_filedump to do that, and how to automate that process and incorporate it
into the harness.

In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and the
problem occurs there, so if the problem is not the checksums themselves
(and I agree it probably isn't) it must have been introduced before that
rather than after.

Cheers,

Jeff

#2Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Janes (#1)
Re: corrupt pages detected by enabling checksums

On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

And then the error:

27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517
27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589

looking at xlog from that time, the first records are:
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint: redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid: 843781
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file create: base/16384/835589
rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update: rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0

So the file was created after the last checkpoint, so no full pages need
to be logged. And we theoretically should be able to recovery all things
like torn pages from the WAL since that should cover everything that
happened to that file.

The bkp block 1 indicated in the 4th record above is the only one in
that period of WAL btw.

Looking a bit more, the last page that's covered by WAL is:
rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert: rel 1663/16384/835589; tid 44/56

which is far earlier than the "block 90" the error is found in unless i
misunderstand something. Which is strange, since non-zero content to
those pages shouldn't go to disk until the respective WAL is
flushed. Huh, are we missing something here?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Noname
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 01:52:41 +0200, Andres Freund wrote:

On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

And then the error:

27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517
27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589

looking at xlog from that time, the first records are:
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint: redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid: 843781
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file create: base/16384/835589
rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update: rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0

So the file was created after the last checkpoint, so no full pages need
to be logged. And we theoretically should be able to recovery all things
like torn pages from the WAL since that should cover everything that
happened to that file.

The bkp block 1 indicated in the 4th record above is the only one in
that period of WAL btw.

Looking a bit more, the last page that's covered by WAL is:
rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert: rel 1663/16384/835589; tid 44/56

which is far earlier than the "block 90" the error is found in unless i
misunderstand something. Which is strange, since non-zero content to
those pages shouldn't go to disk until the respective WAL is
flushed. Huh, are we missing something here?

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Janes (#1)
1 attachment(s)
Re: corrupt pages detected by enabling checksums

On Wed, 2013-04-03 at 15:57 -0700, Jeff Janes wrote:

You don't know that the cluster is in the bad state until after it
goes through recovery because most crashes recover perfectly fine. So
it would have to make a side-copy of the cluster after the crash, then
recover the original and see how things go, then either retain or
delete the side-copy. Unfortunately my testing harness can't do this
at the moment, because the perl script storing the consistency info
needs to survive over the crash and recovery. It might take
me awhile to figure out how to make it do this.

Hmm... you're right, that is difficult to integrate into a test.

Another approach is to expand PageHeaderIsValid in a pre-checksums
version to do some extra checks (perhaps the same checks as pg_filedump
does). It might be able to at least detect simple cases, like all zeros
between pd_upper and pd_special (when pd_upper < pd_special). Still not
easy, but maybe more practical than saving the directory like that. We
may even want a GUC for that to aid future debugging (obviously it would
have a performance impact).

The other 3 files in it constitute the harness. It is a quite a mess,
with some hard-coded paths. The just-posted fix for wal_keep_segments
will also have to be applied.

I'm still trying to reproduce the problem myself. I keep trying simpler
versions of your test and I don't see the problem. It's a little awkward
to try to run the full version of your test (and I'm jumping between
code inspection and various other ideas).

I'll work on it, but it will take awhile to figure out exactly how to
use pg_filedump to do that, and how to automate that process and
incorporate it into the harness.

It might be more productive to try to expand PageHeaderIsValid as I
suggested above.

In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and
the problem occurs there, so if the problem is not the checksums
themselves (and I agree it probably isn't) it must have been
introduced before that rather than after.

Thank you for all of your work on this. I have also attached another
test patch that disables most of the complex areas of the checksums
patch (VM bits, hint bits, etc.). If you apply different portions of the
patch (or the whole thing), it will help eliminate possibilities. In
particular, eliminating the calls to PageSetAllVisible+visibilitymap_set
could tell us a lot.

Also, the first data directory you posted had a problem with a page that
appeared to be a new page (otherwise I would have expected either old or
new data at the end). Do you think this might be a problem only for new
pages? Or do you think your test just makes it likely that many writes
will happen on new pages?

I did find one potential problem in the checksums code (aside from my
previous patch involving wal_level=archive): visibilitymap_set is called
sometimes before the heap page is marked dirty. I will examine a little
more closely, but I expect that to require another patch. Not sure if
that could explain this problem or not.

Regards,
Jeff Davis

Attachments:

checksums-check.patchtext/x-patch; charset=ISO-8859-1; name=checksums-check.patchDownload
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 257,264 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		 * point in repeating the prune/defrag process until something else
  		 * happens to the page.
  		 */
! 		if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
! 			PageIsFull(page))
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
--- 257,264 ----
  		 * point in repeating the prune/defrag process until something else
  		 * happens to the page.
  		 */
! 		if (false && (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
! 					  PageIsFull(page)))
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 668,674 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  			freespace = PageGetHeapFreeSpace(page);
  
  			/* empty pages are always all-visible */
! 			if (!PageIsAllVisible(page))
  			{
  				PageSetAllVisible(page);
  				MarkBufferDirty(buf);
--- 668,674 ----
  			freespace = PageGetHeapFreeSpace(page);
  
  			/* empty pages are always all-visible */
! 			if (false && !PageIsAllVisible(page))
  			{
  				PageSetAllVisible(page);
  				MarkBufferDirty(buf);
***************
*** 901,907 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (all_visible)
  		{
  			if (!PageIsAllVisible(page))
  			{
--- 901,907 ----
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (false && all_visible)
  		{
  			if (!PageIsAllVisible(page))
  			{
***************
*** 1149,1155 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
  	 * Now that we have removed the dead tuples from the page, once again check
  	 * if the page has become all-visible.
  	 */
! 	if (!visibilitymap_test(onerel, blkno, vmbuffer) &&
  		heap_page_is_all_visible(buffer, &visibility_cutoff_xid))
  	{
  		Assert(BufferIsValid(*vmbuffer));
--- 1149,1155 ----
  	 * Now that we have removed the dead tuples from the page, once again check
  	 * if the page has become all-visible.
  	 */
! 	if (false && !visibilitymap_test(onerel, blkno, vmbuffer) &&
  		heap_page_is_all_visible(buffer, &visibility_cutoff_xid))
  	{
  		Assert(BufferIsValid(*vmbuffer));
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 111,116 **** static inline void
--- 111,118 ----
  SetHintBits(HeapTupleHeader tuple, Buffer buffer,
  			uint16 infomask, TransactionId xid)
  {
+ 	return;
+ 
  	if (TransactionIdIsValid(xid))
  	{
  		/* NB: xid must be known committed here! */
#5Andres Freund
andres@2ndquadrant.com
In reply to: Noname (#3)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 02:28:32 +0200, Andres Freund wrote:

On 2013-04-04 01:52:41 +0200, Andres Freund wrote:

On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&amp;usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

By now I am pretty sure the issue is that the end-of-wal is detected too
early. Above the end is detected at 1/31169A38, but the next WAL file
contains valid data:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000032 -n 10
rmgr: Heap2 len (rec/tot): 26/ 58, tx: 0, lsn: 1/32000030, prev 1/31FFFFD8, bkp: 0000, desc: clean: rel 1663/16384/835589; blk 122 remxid 26328926
rmgr: Heap len (rec/tot): 51/ 83, tx: 26328964, lsn: 1/32000070, prev 1/32000030, bkp: 0000, desc: update: rel 1663/16384/835589; tid 122/191 xmax 26328964 ; new tid 129/140 xmax 0
rmgr: Heap2 len (rec/tot): 30/ 62, tx: 0, lsn: 1/320000C8, prev 1/32000070, bkp: 0000, desc: clean: rel 1663/16384/835589; blk 63 remxid 26328803
rmgr: Btree len (rec/tot): 34/ 66, tx: 26328964, lsn: 1/32000108, prev 1/320000C8, bkp: 0000, desc: insert: rel 1663/16384/835590; tid 25/14

That would explains exactly those symptopms, wouldn't it? Whats causing
that - I am not sure yet.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#3)
Re: corrupt pages detected by enabling checksums

andres@anarazel.de (Andres Freund) writes:

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: corrupt pages detected by enabling checksums

On 2013-04-03 20:45:51 -0400, Tom Lane wrote:

andres@anarazel.de (Andres Freund) writes:

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.

Nope, thats from the heap, and the LSNs are *newer* than what startup
recovered to. I am pretty sure by now we are missing out on valid WAL, I
am just not sure why.

Unfortunately we can't easily diagnose what happened at:
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
since the startup process wrote its end of recovery checkpoint there:
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31169A68, prev 1/31169A38, bkp: 0000, desc: checkpoint: redo 1/31169A68; tli 1; prev tli 1; fpw true; xid 0/26254999; oid 843781; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown

Starting from a some blocks in that wal segments later:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000031 -s 1/3116c000 -n 10
first record is after 1/3116C000, at 1/3116D9D8, skipping over 6616 bytes
rmgr: Heap len (rec/tot): 51/ 83, tx: 26254999, lsn: 1/3116D9D8, prev 1/3116BA20, bkp: 0000, desc: update: rel 1663/16384/835589; tid 38/148 xmax 26254999 ; new tid 44/57 xmax 0
rmgr: Btree len (rec/tot): 34/ 66, tx: 26254999, lsn: 1/3116DA30, prev 1/3116D9D8, bkp: 0000, desc: insert: rel 1663/16384/835590; tid 25/319
rmgr: Heap len (rec/tot): 51/ 83, tx: 26255000, lsn: 1/3116DA78, prev 1/3116DA30, bkp: 0000, desc: update: rel 1663/16384/835589; tid 19/214 xmax 26255000 ; new tid 44/58 xmax 0

the records continue again.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 02:58:43 +0200, Andres Freund wrote:

On 2013-04-03 20:45:51 -0400, Tom Lane wrote:

andres@anarazel.de (Andres Freund) writes:

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.

Nope, thats from the heap, and the LSNs are *newer* than what startup
recovered to. I am pretty sure by now we are missing out on valid WAL, I
am just not sure why.

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Remember, as explained in other mails, I am pretty sure the end of wal
was errorneously detected, resulting in the following error:
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68

Not that without a *hardware* crash end of wal is normally found first
by the checks for wrong LSNs or wrong record lengths. Not here.

My theory is that a page that was full page written changed while it was
inserted into the WAL which caused the error. A possibly scenario for
that would e.g. two concurrent calls to MarkBufferDirtyHint().
Note:
* 2. The caller might have only share-lock instead of exclusive-lock on the
* buffer's content lock.

if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
if (DataChecksumsEnabled() && (bufHdr->flags & BM_PERMANENT))
{
MyPgXact->delayChkpt = delayChkpt = true;
lsn = XLogSaveBufferForHint(buffer); /* PART 1 */
}
}

LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);

if (!(bufHdr->flags & BM_DIRTY))
{
dirtied = true; /* Means "will be dirtied by this action" */

/*
* Set the page LSN if we wrote a backup block. We aren't
* supposed to set this when only holding a share lock but
* as long as we serialise it somehow we're OK. We choose to
* set LSN while holding the buffer header lock, which causes
* any reader of an LSN who holds only a share lock to also
* obtain a buffer header lock before using PageGetLSN().
* Fortunately, thats not too many places.
*
* If checksums are enabled, you might think we should reset the
* checksum here. That will happen when the page is written
* sometime later in this checkpoint cycle.
*/
if (!XLogRecPtrIsInvalid(lsn))
PageSetLSN(page, lsn); /* PART 2 */
}

So XLogSaveBufferForHint(buffer) is called for all buffers which are not
yet dirty. Without any locks. It just does:

XLogSaveBufferForHint(Buffer buffer)
{
rdata[0].data = (char *) (&watermark);
rdata[0].len = sizeof(int);
rdata[0].next = &(rdata[1]);

rdata[1].data = NULL;
rdata[1].len = 0;
rdata[1].buffer = buffer;
rdata[1].buffer_std = true;
rdata[1].next = NULL;

return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

I think what happens is that that one backend grabs WALInsertLock first,
it computes a full page write for the buffer, including a CRC. And
returns a valid LSN. Then it continues towards the PageSetLSN() after
LockBufHdr().
Allthewhile the second backend notices that WALInsertLock is free again
and continues towards the WALInsertLock protected parts of XLogInsert().

Only that it already has done that part:
/*
* Calculate CRC of the data, including all the backup blocks
*
* Note that the record header isn't added into the CRC initially since we
* don't know the prev-link yet. Thus, the CRC will represent the CRC of
* the whole record in the order: rdata, then backup blocks, then record
* header.
*/
INIT_CRC32(rdata_crc);
for (rdt = rdata; rdt != NULL; rdt = rdt->next)
COMP_CRC32(rdata_crc, rdt->data, rdt->len);

It goes on to write all the data to the WAL after successfully getting
WALInsertLock:
while (write_len)
{
while (rdata->data == NULL)
rdata = rdata->next;

if (freespace > 0)
{
if (rdata->len > freespace)
{
memcpy(Insert->currpos, rdata->data, freespace);
rdata->data += freespace;
rdata->len -= freespace;
write_len -= freespace;
}
else
{
memcpy(Insert->currpos, rdata->data, rdata->len);
freespace -= rdata->len;
write_len -= rdata->len;
Insert->currpos += rdata->len;
rdata = rdata->next;
continue;
}
}

/* Use next buffer */
updrqst = AdvanceXLInsertBuffer(false);
curridx = Insert->curridx;
/* Mark page header to indicate this record continues on the page */
Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
Insert->currpage->xlp_rem_len = write_len;
freespace = INSERT_FREESPACE(Insert);
}

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

Does this explanation make sense?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#8)
1 attachment(s)
Re: corrupt pages detected by enabling checksums

On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Well spotted. (I think you mean checksums patch).

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

When we write out a buffer with checksums enabled, we take a copy of
the buffer so that the checksum is consistent, even while other
backends may be writing hints to the same bufer.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

copy_before_XLOG_HINT.v1.patchapplication/octet-stream; name=copy_before_XLOG_HINT.v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2f91bc8..92f6800 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -518,6 +518,11 @@ static XLogCtlData *XLogCtl = NULL;
 static ControlFileData *ControlFile = NULL;
 
 /*
+ * Local bufferspace for use in XLOG_HINT records.
+ */
+char *HintBuffer;
+
+/*
  * Macros for managing XLogInsert state.  In most cases, the calling routine
  * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
  * so these are passed as parameters instead of being fetched via XLogCtl.
@@ -880,7 +885,10 @@ begin:;
 		info |= XLR_BKP_BLOCK(i);
 
 		bkpb = &(dtbuf_xlg[i]);
-		page = (char *) BufferGetBlock(dtbuf[i]);
+		if (isHint)
+			page = (char *) HintBuffer;
+		else
+			page = (char *) BufferGetBlock(dtbuf[i]);
 
 		rdt->next = &(dtbuf_rdt1[i]);
 		rdt = rdt->next;
@@ -7663,6 +7671,19 @@ XLogSaveBufferForHint(Buffer buffer)
 	int			watermark = XLOG_HINT_WATERMARK;
 
 	/*
+	 * Copy the buffer to a local area, so that we can calculate the WAL
+	 * CRC without worrying that we only have share lock on the buffer.
+	 */
+	if (HintBuffer == NULL)
+	{
+		HintBuffer = malloc(BLCKSZ);
+		if (HintBuffer == NULL)
+			elog(ERROR, "out of memory");
+		MemSet(HintBuffer, 0, BLCKSZ);
+	}
+	memcpy(HintBuffer, (char *) BufferGetBlock(buffer), BLCKSZ);
+
+	/*
 	 * Not allowed to have zero-length records, so use a small watermark
 	 */
 	rdata[0].data = (char *) (&watermark);
#10Andres Freund
andres@2ndquadrant.com
In reply to: Simon Riggs (#9)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 13:30:40 +0100, Simon Riggs wrote:

On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Well spotted. (I think you mean checksums patch).

Heh, yes. I was slightly tired at that point ;)

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

Well, there might be other bugs we can't see yet... But lets hope not.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

When we write out a buffer with checksums enabled, we take a copy of
the buffer so that the checksum is consistent, even while other
backends may be writing hints to the same bufer.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

Unfortunately I find that approach unacceptably ugly. I don't think its
ok to make an already *very* complicated function (XLogInsert()) in one
of the most complicated files (xlog.c) even more complicated. It
literally took me years to understand a large percentage of it.
I even think the XLOG_HINT escape hatch should be taken out and be
replaced by something outside of XLogInsert().

Don't think that approach would work with as few lines anyway, since you
need to properly pass it into XLogCheckBuffer() et al which checks the
buffer tags and such - which it needs to properly compute the struct
BkpBlock. Also we iterate over rdata->page for the CRC computation.

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

Something like:

XLogRecPtr
XLogSaveBufferForHint(Buffer buffer)
{
XLogRecPtr recptr = InvalidXLogRecPtr;
XLogRecPtr lsn;
XLogRecData rdata[2];
BkbBlock bkp;

/*
* make sure no checkpoint can happen while we decide about logging
* something or not, since we don't hold any lock preventing that
* otherwise.
*/
Assert(MyPgXact->delayChkpt);

/* update RedoRecPtr */
GetRedoRecPtr();

/* setup phony rdata element */
rdata[0].buffer = buffer;
rdata[0].buffer_std = true; /* is that actually ok? */

/* force full pages on, otherwise checksums won't work? */
if (XLogCheckBuffer(rdata, true, &lsn, &bkp))
{
char copied_buffer[BLCKSZ];

/*
* copy buffer so we don't have to worry about
* concurrent hint bit or lsn updates. We assume pd_lower/upper
* cannot be changed without an exclusive lock, so the contents
* bkp are not racy.
*/
memcpy(copied_buffer, BufferGetBlock(buffer), BLCKSZ);

rdata[0].data = bkp;
rdata[0].len = sizeof(BkbBlock);
rdata[0].buffer = InvalidBuffer;
rdata[0].buffer_std = false;
rdata[0].next = &(rdata[1]);

rdata[1].data = copied_buffer;
rdata[1].len = BLCKSZ;
rdata[1].buffer = InvalidBuffer;
rdata[1].buffer_std = false;
rdata[1].next = NULL;

recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

return recptr;
}

That should work?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#10)
Re: corrupt pages detected by enabling checksums

On 4 April 2013 15:53, Andres Freund <andres@2ndquadrant.com> wrote:

Unfortunately I find that approach unacceptably ugly.

Yeh. If we can confirm its a fix we can discuss a cleaner patch and
that is much better.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#10)
Re: corrupt pages detected by enabling checksums

Andres,

Thank you for diagnosing this problem!

On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

I like this approach. It may have some performance impact though,
because there are a couple extra spinlocks taken, and an extra memcpy.
The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

Another possible approach is to drop the lock on the buffer and
re-acquire it in exclusive mode after we find out we'll need to do
XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
longer, but would avoid the memcpy. I haven't really thought through the
details.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Davis (#12)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 12:59:36 -0700, Jeff Davis wrote:

Andres,

Thank you for diagnosing this problem!

On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

I like this approach. It may have some performance impact though,
because there are a couple extra spinlocks taken, and an extra memcpy.

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

Another possible approach is to drop the lock on the buffer and
re-acquire it in exclusive mode after we find out we'll need to do
XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
longer, but would avoid the memcpy. I haven't really thought through the
details.

That sounds like it would be prone to deadlocks. So I would dislike to
go there.

Will write up a patch tomorrow.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#9)
Re: corrupt pages detected by enabling checksums

On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

This brings up a pretty frightening possibility to me, unrelated to data
checksums. If a bit gets twiddled in the WAL file due to a hardware issue
or a "cosmic ray", and then a crash happens, automatic recovery will stop
early with the failed WAL checksum with an innocuous looking message. The
system will start up but will be invisibly inconsistent, and will proceed
to overwrite that portion of the WAL file which contains the old data (real
data that would have been necessary to reconstruct, once the corruption is
finally realized ) with an end-of-recovery checkpoint record and continue
to chew up real data from there.

I don't know a solution here, though, other than trusting your hardware.
Changing timelines upon ordinary crash recovery, not just media recovery,
seems excessive but also seems to be exactly what timelines were invented
for, right?

Back to the main topic here, Jeff Davis mentioned earlier "You'd still
think this would cause incorrect results, but...". I didn't realize the
significance of that until now. It does produce incorrect query results.
I was just bailing out before detecting them. Once I specify
ignore_checksum_failure=1
my test harness complains bitterly about the data not being consistent with
what the Perl program knows it is supposed to be.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

I've tested your patch it and it seems to do the job. It has survived far
longer than unpatched ever did, with neither checksum warnings, nor
complaints of inconsistent data. (I haven't analyzed the code much, just
the results, and leave the discussion of the best approach to others)

Thanks,

Jeff

#15Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Janes (#14)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:

This brings up a pretty frightening possibility to me, unrelated to
data checksums. If a bit gets twiddled in the WAL file due to a
hardware issue or a "cosmic ray", and then a crash happens, automatic
recovery will stop early with the failed WAL checksum with
an innocuous looking message. The system will start up but will be
invisibly inconsistent, and will proceed to overwrite that portion of
the WAL file which contains the old data (real data that would have
been necessary to reconstruct, once the corruption is finally realized
) with an end-of-recovery checkpoint record and continue to chew up
real data from there.

I've been worried about that for a while, and I may have even seen
something like this happen before. We could perhaps do some checks, but
in general it seems hard to solve without writing flushing some data to
two different places. For example, you could flush WAL, and then update
an LSN stored somewhere else indicating how far the WAL has been
written. Recovery could complain if it gets an error in the WAL before
that point.

But obviously, that makes WAL flushes expensive (in many cases, about
twice as expensive).

Maybe it's not out of the question to offer that as an option if nobody
has a better idea. Performance-conscious users could place the extra LSN
on an SSD or NVRAM or something; or maybe use commit_delay or async
commits. It would only need to store a few bytes.

Streaming replication mitigates the problem somewhat, by being a second
place to write WAL.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#13)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

Good point.

The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

I was thinking more about the WAL size, but I don't have a strong
opinion.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#15)
Re: corrupt pages detected by enabling checksums

Jeff Davis <pgsql@j-davis.com> writes:

On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:

This brings up a pretty frightening possibility to me, unrelated to
data checksums. If a bit gets twiddled in the WAL file due to a
hardware issue or a "cosmic ray", and then a crash happens, automatic
recovery will stop early with the failed WAL checksum with
an innocuous looking message. The system will start up but will be
invisibly inconsistent, and will proceed to overwrite that portion of
the WAL file which contains the old data (real data that would have
been necessary to reconstruct, once the corruption is finally realized
) with an end-of-recovery checkpoint record and continue to chew up
real data from there.

I've been worried about that for a while, and I may have even seen
something like this happen before. We could perhaps do some checks, but
in general it seems hard to solve without writing flushing some data to
two different places. For example, you could flush WAL, and then update
an LSN stored somewhere else indicating how far the WAL has been
written. Recovery could complain if it gets an error in the WAL before
that point.

But obviously, that makes WAL flushes expensive (in many cases, about
twice as expensive).

Maybe it's not out of the question to offer that as an option if nobody
has a better idea. Performance-conscious users could place the extra LSN
on an SSD or NVRAM or something; or maybe use commit_delay or async
commits. It would only need to store a few bytes.

At least on traditional rotating media, this is only going to perform
well if you dedicate two drives to the purpose. At which point you
might as well just say "let's write two copies of WAL". Or maybe three
copies, so that you can take a vote when they disagree. While this is
not so unreasonable on its face for ultra-high-reliability requirements,
I can't escape the feeling that we'd just be reinventing software RAID.
There's no reason to think that we can deal with this class of problems
better than the storage system can.

Streaming replication mitigates the problem somewhat, by being a second
place to write WAL.

Yeah, if you're going to do this at all it makes more sense for the
redundant copies to be on other machines. So the questions that that
leads to are how smart is our SR code about dealing with a master that
tries to re-send WAL regions it already sent, and what a slave ought to
do in such a situation if the new data doesn't match.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#17)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-04-04 at 21:06 -0400, Tom Lane wrote:

I can't escape the feeling that we'd just be reinventing software RAID.
There's no reason to think that we can deal with this class of problems
better than the storage system can.

The goal would be to reliably detect a situation where WAL that has been
flushed successfully was later corrupted; not necessarily to recover
when we find it. Unless it's something like ZFS, I don't think most
software RAID will reliably detect corruption.

A side benefit is that it would also help catch bugs like this in the
future.

I'm not advocating for this particular solution, but I do concur with
Jeff Janes that it's a little bit scary and that we can entertain some
ideas about how to mitigate it.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Florian Pflug
fgp@phlo.org
In reply to: Jeff Janes (#14)
Re: corrupt pages detected by enabling checksums

On Apr4, 2013, at 23:21 , Jeff Janes <jeff.janes@gmail.com> wrote:

This brings up a pretty frightening possibility to me, unrelated to data checksums. If a bit gets twiddled in the WAL file due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with the failed WAL checksum with an innocuous looking message. The system will start up but will be invisibly inconsistent, and will proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessary to reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continue to chew up real data from there.

Maybe we could scan forward to check whether a corrupted WAL record is followed by one or more valid ones with sensible LSNs. If it is, chances are high that we haven't actually hit the end of the WAL. In that case, we could either log a warning, or (better, probably) abort crash recovery. The user would then need to either restore the broken WAL segment from backup, or override the check by e.g. setting recovery_target_record="invalid_record". (The default would be recovery_target_record="last_record". The name of the GUC tries to be consistent with existing recovery.conf settings, even though it affects crash recovery, not archive recovery.)

Corruption of fields which we require to scan past the record would cause false negatives, i.e. no trigger an error even though we do abort recovery mid-way through. There's a risk of false positives too, but they require quite specific orderings of writes and thus seem rather unlikely. (AFAICS, the OS would have to write some parts of record N followed by the whole of record N+1 and then crash to cause a false positive).

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Davis (#16)
1 attachment(s)
Re: corrupt pages detected by enabling checksums

On 2013-04-04 17:39:16 -0700, Jeff Davis wrote:

On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

Good point.

The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

I was thinking more about the WAL size, but I don't have a strong
opinion.

I was just a bit dense. No idea what I missed there.

How does the attached version look? I verified that it survives
recovery, but not more.

Jeff, any chance you can run this for a round with your suite?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-checksums-Log-hint-bit-writes-in-a-concurrency-safe-.patchtext/x-patch; charset=us-asciiDownload
>From 3c086daffb994665b23bc65a1017cb71abef17bf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 5 Apr 2013 15:02:35 +0200
Subject: [PATCH] checksums: Log hint bit writes in a concurrency safe manner

The previous manner was racy since share locked pages - which
XLogSaveBufferForHint() gets passed - may still be modified. Namely further
hint bit updates and PageSetLSN() can be performed. That could lead to checksum
failures in WAL since the page changed between checksum computation and the
actual write to WAL.

Instead, and only if required, write a copy of the page as a normal
record. struct BkpBlock is still used so we can use most of the usual backup
block restoration code.
---
 src/backend/access/rmgrdesc/xlogdesc.c |    6 +-
 src/backend/access/transam/xlog.c      |  233 ++++++++++++++++++--------------
 src/include/catalog/catversion.h       |    2 +-
 3 files changed, 136 insertions(+), 105 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 52cf759..4c68b6a 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -17,6 +17,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_control.h"
+#include "common/relpath.h"
 #include "utils/guc.h"
 #include "utils/timestamp.h"
 
@@ -83,7 +84,10 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
 	}
 	else if (info == XLOG_HINT)
 	{
-		appendStringInfo(buf, "page hint");
+		BkpBlock *bkp = (BkpBlock *) rec;
+		appendStringInfo(buf, "page hint: %s block %u",
+						 relpathperm(bkp->node, bkp->fork),
+						 bkp->block);
 	}
 	else if (info == XLOG_BACKUP_END)
 	{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a9462e..3e59f94 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -648,6 +648,8 @@ static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 
 static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
 				XLogRecPtr *lsn, BkpBlock *bkpb);
+static Buffer RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb,
+				char *blk, bool get_cleanup_lock, bool keep_buffer);
 static bool AdvanceXLInsertBuffer(bool new_segment);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
@@ -731,7 +733,6 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	bool		updrqst;
 	bool		doPageWrites;
 	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
-	bool		isHint = (rmid == RM_XLOG_ID && info == XLOG_HINT);
 	uint8		info_orig = info;
 	static XLogRecord *rechdr;
 
@@ -1002,18 +1003,6 @@ begin:;
 	}
 
 	/*
-	 * If this is a hint record and we don't need a backup block then
-	 * we have no more work to do and can exit quickly without inserting
-	 * a WAL record at all. In that case return InvalidXLogRecPtr.
-	 */
-	if (isHint && !(info & XLR_BKP_BLOCK_MASK))
-	{
-		LWLockRelease(WALInsertLock);
-		END_CRIT_SECTION();
-		return InvalidXLogRecPtr;
-	}
-
-	/*
 	 * If the current page is completely full, the record goes to the next
 	 * page, right after the page header.
 	 */
@@ -3158,8 +3147,6 @@ Buffer
 RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
 				   bool get_cleanup_lock, bool keep_buffer)
 {
-	Buffer		buffer;
-	Page		page;
 	BkpBlock	bkpb;
 	char	   *blk;
 	int			i;
@@ -3177,42 +3164,8 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
 		if (i == block_index)
 		{
 			/* Found it, apply the update */
-			buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
-											RBM_ZERO);
-			Assert(BufferIsValid(buffer));
-			if (get_cleanup_lock)
-				LockBufferForCleanup(buffer);
-			else
-				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-			page = (Page) BufferGetPage(buffer);
-
-			if (bkpb.hole_length == 0)
-			{
-				memcpy((char *) page, blk, BLCKSZ);
-			}
-			else
-			{
-				memcpy((char *) page, blk, bkpb.hole_offset);
-				/* must zero-fill the hole */
-				MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
-				memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
-					   blk + bkpb.hole_offset,
-					   BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
-			}
-
-			/*
-			 * Any checksum set on this page will be invalid. We don't need
-			 * to reset it here since it will be set before being written.
-			 */
-
-			PageSetLSN(page, lsn);
-			MarkBufferDirty(buffer);
-
-			if (!keep_buffer)
-				UnlockReleaseBuffer(buffer);
-
-			return buffer;
+			return RestoreBackupBlockContents(lsn, bkpb, blk, get_cleanup_lock,
+											  keep_buffer);
 		}
 
 		blk += BLCKSZ - bkpb.hole_length;
@@ -3224,6 +3177,56 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
 }
 
 /*
+ * Workhorse for RestoreBackupBlock usable without an xlog record
+ *
+ * Restores a full-page image from BkpBlock and a data pointer.
+ */
+static Buffer
+RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb, char *blk,
+						   bool get_cleanup_lock, bool keep_buffer)
+{
+	Buffer		buffer;
+	Page		page;
+
+	buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
+									RBM_ZERO);
+	Assert(BufferIsValid(buffer));
+	if (get_cleanup_lock)
+		LockBufferForCleanup(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	page = (Page) BufferGetPage(buffer);
+
+	if (bkpb.hole_length == 0)
+	{
+		memcpy((char *) page, blk, BLCKSZ);
+	}
+	else
+	{
+		memcpy((char *) page, blk, bkpb.hole_offset);
+		/* must zero-fill the hole */
+		MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
+		memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
+			   blk + bkpb.hole_offset,
+			   BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
+	}
+
+	/*
+	 * Any checksum set on this page will be invalid. We don't need
+	 * to reset it here since it will be set before being written.
+	 */
+
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(buffer);
+
+	if (!keep_buffer)
+		UnlockReleaseBuffer(buffer);
+
+	return buffer;
+}
+
+/*
  * Attempt to read an XLOG record.
  *
  * If RecPtr is not NULL, try to read a record at that position.  Otherwise
@@ -7638,45 +7641,74 @@ XLogRestorePoint(const char *rpName)
  * Write a backup block if needed when we are setting a hint. Note that
  * this may be called for a variety of page types, not just heaps.
  *
- * Deciding the "if needed" part is delicate and requires us to either
- * grab WALInsertLock or check the info_lck spinlock. If we check the
- * spinlock and it says Yes then we will need to get WALInsertLock as well,
- * so the design choice here is to just go straight for the WALInsertLock
- * and trust that calls to this function are minimised elsewhere.
- *
  * Callable while holding just share lock on the buffer content.
  *
- * Possible that multiple concurrent backends could attempt to write
- * WAL records. In that case, more than one backup block may be recorded
- * though that isn't important to the outcome and the backup blocks are
- * likely to be identical anyway.
+ * We can't use the plain backup block mechanism since that relies on the
+ * Buffer being exclusively locked. Since some modifications (setting LSN, hint
+ * bits) are allowed in a sharelocked buffer that can lead to wal checksum
+ * failures. So instead we copy the page and insert the copied data as normal
+ * record data.
+ *
+ * We only need to do something if page has not yet been full page written in
+ * this checkpoint round. The LSN of the inserted wal record is inserted if we
+ * had to write, InvalidXLogRecPtr otherwise.
+ *
+ * It is possible that multiple concurrent backends could attempt to write WAL
+ * records. In that case, more than one block image may be recorded but that's
+ * ok from a correctness POV.
  */
-#define	XLOG_HINT_WATERMARK		13579
 XLogRecPtr
 XLogSaveBufferForHint(Buffer buffer)
 {
-	/*
-	 * Make an XLOG entry reporting the hint
-	 */
-	XLogRecData rdata[2];
-	int			watermark = XLOG_HINT_WATERMARK;
-
-	/*
-	 * Not allowed to have zero-length records, so use a small watermark
-	 */
-	rdata[0].data = (char *) (&watermark);
-	rdata[0].len = sizeof(int);
-	rdata[0].buffer = InvalidBuffer;
-	rdata[0].buffer_std = false;
-	rdata[0].next = &(rdata[1]);
-
-	rdata[1].data = NULL;
-	rdata[1].len = 0;
-	rdata[1].buffer = buffer;
-	rdata[1].buffer_std = true;
-	rdata[1].next = NULL;
-
-	return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
+    XLogRecPtr recptr = InvalidXLogRecPtr;
+    XLogRecPtr lsn;
+    XLogRecData rdata[2];
+    BkpBlock bkpb;
+
+    /*
+     * Ensure no checkpoint can happen while we decide about logging something
+     * based on RedoRecPtr, since we don't hold any lock preventing that
+     * otherwise.
+     */
+    Assert(MyPgXact->delayChkpt);
+
+    /* update RedoRecPtr so XLogCheckBuffer can make the right decision */
+    GetRedoRecPtr();
+
+    /* setup phony rdata element for usage of XLogCheckBuffer */
+    rdata[0].buffer = buffer;
+    rdata[0].buffer_std = true; /* is that actually ok? E.g. fsm might not */
+
+    /* force full pages on, otherwise checksums won't work? */
+    if (XLogCheckBuffer(rdata, true, &lsn, &bkpb))
+    {
+        char copied_buffer[BLCKSZ];
+		char *origdata = (char *) BufferGetBlock(buffer);
+
+        /*
+         * copy buffer so we don't have to worry about concurrent hint bit or
+         * lsn updates. We assume pd_lower/upper cannot be changed without an
+         * exclusive lock, so the contents bkp are not racy.
+         */
+        memcpy(copied_buffer, origdata, bkpb.hole_offset);
+        memcpy(copied_buffer + bkpb.hole_offset,
+			   origdata + bkpb.hole_offset + bkpb.hole_length,
+			   BLCKSZ - bkpb.hole_offset - bkpb.hole_length);
+
+        rdata[0].data = (char *) &bkpb;
+        rdata[0].len = sizeof(BkpBlock);
+        rdata[0].buffer = InvalidBuffer;
+        rdata[0].next = &(rdata[1]);
+
+        rdata[1].data = copied_buffer;
+        rdata[1].len = BLCKSZ - bkpb.hole_length;
+        rdata[1].buffer = InvalidBuffer;
+        rdata[1].next = NULL;
+
+        recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
+    }
+
+    return recptr;
 }
 
 /*
@@ -7845,8 +7877,8 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
 
-	/* Backup blocks are not used in most xlog records */
-	Assert(info == XLOG_HINT || !(record->xl_info & XLR_BKP_BLOCK_MASK));
+	/* Backup blocks are not used in xlog xlog records */
+	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
 
 	if (info == XLOG_NEXTOID)
 	{
@@ -8041,31 +8073,26 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 	}
 	else if (info == XLOG_HINT)
 	{
-#ifdef USE_ASSERT_CHECKING
-		int	*watermark = (int *) XLogRecGetData(record);
-#endif
-
-		/* Check the watermark is correct for the hint record */
-		Assert(*watermark == XLOG_HINT_WATERMARK);
-
-		/* Backup blocks must be present for smgr hint records */
-		Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+		char *data;
+		BkpBlock bkpb;
 
 		/*
-		 * Hint records have no information that needs to be replayed.
-		 * The sole purpose of them is to ensure that a hint bit does
-		 * not cause a checksum invalidation if a hint bit write should
-		 * cause a torn page. So the body of the record is empty but
-		 * there must be one backup block.
+		 * Hint bit writes contain a backup block stored "inline" in the normal
+		 * data since the locking when writing HINT records isn't sufficient to
+		 * use the normal backup block mechanism.
 		 *
-		 * Since the only change in the backup block is a hint bit,
-		 * there is no confict with Hot Standby.
+		 * Since the only change in the backup block is a hint bit, there is no
+		 * confict with Hot Standby.
 		 *
 		 * This also means there is no corresponding API call for this,
 		 * so an smgr implementation has no need to implement anything.
 		 * Which means nothing is needed in md.c etc
 		 */
-		RestoreBackupBlock(lsn, record, 0, false, false);
+		data = XLogRecGetData(record);
+		memcpy(&bkpb, XLogRecGetData(record), sizeof(BkpBlock));
+		data += sizeof(BkpBlock);
+
+		RestoreBackupBlockContents(lsn, bkpb, data, false, false);
 	}
 	else if (info == XLOG_BACKUP_END)
 	{
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index aa8b715..8d05c61 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201303291
+#define CATALOG_VERSION_NO	201304051
 
 #endif
-- 
1.7.10.4

#21Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#20)
Re: corrupt pages detected by enabling checksums

On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote:

How does the attached version look? I verified that it survives
recovery, but not more.

Comments:

* Regarding full page writes, we can:
- always write full pages (as in your current patch), regardless of
the current settings
- take WALInsertLock briefly to get the current settings from XLogCtl
- always calculate the full page record, but in XLogInsert, if it
happens to be a HINT record, and full page writes are not necessary,
then discard it (right now I somewhat favor this option but I haven't
looked at the details)

* typo in "Backup blocks are not used in **xlog xlog** records"

* To get the appropriate setting for buffer_std, we should pass it down
through MarkBufferDirty as an extra flag, I think.

* I'm a bit worried that we'll need a cleanup lock when restoring an FSM
page. What do you think?

* In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
Merely a matter of preference but I thought I would mention it.

Jeff, any chance you can run this for a round with your suite?

Yes. I don't have a rigorous test suite, but I'll do some manual tests
and walk through it with gdb.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Jeff Davis
pgsql@j-davis.com
In reply to: Florian Pflug (#19)
Re: corrupt pages detected by enabling checksums

On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote:

Maybe we could scan forward to check whether a corrupted WAL record is
followed by one or more valid ones with sensible LSNs. If it is,
chances are high that we haven't actually hit the end of the WAL. In
that case, we could either log a warning, or (better, probably) abort
crash recovery.

+1.

Corruption of fields which we require to scan past the record would
cause false negatives, i.e. no trigger an error even though we do
abort recovery mid-way through. There's a risk of false positives too,
but they require quite specific orderings of writes and thus seem
rather unlikely. (AFAICS, the OS would have to write some parts of
record N followed by the whole of record N+1 and then crash to cause a
false positive).

Does the xlp_pageaddr help solve this?

Also, we'd need to be a little careful when written-but-not-flushed WAL
data makes it to disk, which could cause a false positive and may be a
fairly common case.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Jaime Casanova
jaime@2ndquadrant.com
In reply to: Andres Freund (#20)
1 attachment(s)
Re: corrupt pages detected by enabling checksums

On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:

How does the attached version look? I verified that it survives
recovery, but not more.

I still got errors when executing make installcheck in a just compiled
9.3devel + this_patch, this is when setting wal_level higher than
minimal.
Attached regression.diffs

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffsDownload
*** /home/postgres/pgreleases/postgresql/src/test/regress/expected/tablespace.out	2013-04-05 19:01:06.000000000 -0500
--- /home/postgres/pgreleases/postgresql/src/test/regress/results/tablespace.out	2013-04-05 19:01:19.000000000 -0500
***************
*** 54,68 ****
  ALTER TABLE testschema.atable SET TABLESPACE testspace;
  ALTER INDEX testschema.anindex SET TABLESPACE testspace;
  INSERT INTO testschema.atable VALUES(3);	-- ok
  INSERT INTO testschema.atable VALUES(1);	-- fail (checks index)
! ERROR:  duplicate key value violates unique constraint "anindex"
! DETAIL:  Key (column1)=(1) already exists.
  SELECT COUNT(*) FROM testschema.atable;		-- checks heap
!  count 
! -------
!      3
! (1 row)
! 
  -- Will fail with bad path
  CREATE TABLESPACE badspace LOCATION '/no/such/location';
  ERROR:  directory "/no/such/location" does not exist
--- 54,67 ----
  ALTER TABLE testschema.atable SET TABLESPACE testspace;
  ALTER INDEX testschema.anindex SET TABLESPACE testspace;
  INSERT INTO testschema.atable VALUES(3);	-- ok
+ WARNING:  page verification failed, calculated checksum 24379 but expected 47546
+ ERROR:  invalid page in block 0 of relation pg_tblspc/16385/PG_9.3_201304051/16384/16401
  INSERT INTO testschema.atable VALUES(1);	-- fail (checks index)
! WARNING:  page verification failed, calculated checksum 24379 but expected 47546
! ERROR:  invalid page in block 0 of relation pg_tblspc/16385/PG_9.3_201304051/16384/16401
  SELECT COUNT(*) FROM testschema.atable;		-- checks heap
! WARNING:  page verification failed, calculated checksum 3350 but expected 53384
! ERROR:  invalid page in block 0 of relation pg_tblspc/16385/PG_9.3_201304051/16384/16402
  -- Will fail with bad path
  CREATE TABLESPACE badspace LOCATION '/no/such/location';
  ERROR:  directory "/no/such/location" does not exist

======================================================================

*** /home/postgres/pgreleases/postgresql/src/test/regress/expected/vacuum.out	2013-04-04 11:20:41.000000000 -0500
--- /home/postgres/pgreleases/postgresql/src/test/regress/results/vacuum.out	2013-04-05 19:02:24.000000000 -0500
***************
*** 30,71 ****
  
  VACUUM FULL vactst;
  UPDATE vactst SET i = i + 1;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst SELECT * FROM vactst;
  INSERT INTO vactst VALUES (0);
  SELECT count(*) FROM vactst;
!  count 
! -------
!   2049
! (1 row)
! 
  DELETE FROM vactst WHERE i != 0;
  VACUUM (FULL) vactst;
  DELETE FROM vactst;
  SELECT * FROM vactst;
!  i 
! ---
! (0 rows)
! 
  VACUUM (FULL, FREEZE) vactst;
  VACUUM (ANALYZE, FULL) vactst;
  CREATE TABLE vaccluster (i INT PRIMARY KEY);
  ALTER TABLE vaccluster CLUSTER ON vaccluster_pkey;
  INSERT INTO vaccluster SELECT * FROM vactst;
  CLUSTER vaccluster;
  VACUUM FULL pg_am;
  VACUUM FULL pg_class;
  VACUUM FULL pg_database;
  VACUUM FULL vaccluster;
  VACUUM FULL vactst;
  DROP TABLE vaccluster;
  DROP TABLE vactst;
--- 30,113 ----
  
  VACUUM FULL vactst;
  UPDATE vactst SET i = i + 1;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  INSERT INTO vactst VALUES (0);
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  SELECT count(*) FROM vactst;
! WARNING:  page verification failed, calculated checksum 9180 but expected 43354
! ERROR:  invalid page in block 0 of relation base/16384/28643
  DELETE FROM vactst WHERE i != 0;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  VACUUM (FULL) vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  DELETE FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  SELECT * FROM vactst;
! WARNING:  page verification failed, calculated checksum 9180 but expected 43354
! ERROR:  invalid page in block 0 of relation base/16384/28643
  VACUUM (FULL, FREEZE) vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  VACUUM (ANALYZE, FULL) vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  CREATE TABLE vaccluster (i INT PRIMARY KEY);
  ALTER TABLE vaccluster CLUSTER ON vaccluster_pkey;
  INSERT INTO vaccluster SELECT * FROM vactst;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  CLUSTER vaccluster;
  VACUUM FULL pg_am;
+ WARNING:  page verification failed, calculated checksum 37138 but expected 6194
+ ERROR:  invalid page in block 0 of relation base/16384/28664
  VACUUM FULL pg_class;
+ WARNING:  page verification failed, calculated checksum 12202 but expected 40441
+ ERROR:  invalid page in block 0 of relation base/16384/28667
  VACUUM FULL pg_database;
+ ERROR:  could not read block 4 in file "base/16384/28667": read only 0 of 8192 bytes
  VACUUM FULL vaccluster;
+ ERROR:  could not read block 15 in file "base/16384/28667": read only 0 of 8192 bytes
  VACUUM FULL vactst;
+ ERROR:  could not read block 15 in file "base/16384/28667": read only 0 of 8192 bytes
  DROP TABLE vaccluster;
+ ERROR:  could not read block 15 in file "base/16384/28667": read only 0 of 8192 bytes
  DROP TABLE vactst;
+ ERROR:  could not read block 15 in file "base/16384/28667": read only 0 of 8192 bytes

======================================================================

*** /home/postgres/pgreleases/postgresql/src/test/regress/expected/sanity_check.out	2013-04-04 11:20:41.000000000 -0500
--- /home/postgres/pgreleases/postgresql/src/test/regress/results/sanity_check.out	2013-04-05 19:02:30.000000000 -0500
***************
*** 1,4 ****
--- 1,6 ----
  VACUUM;
+ WARNING:  page verification failed, calculated checksum 9180 but expected 43354
+ ERROR:  invalid page in block 0 of relation base/16384/28643
  --
  -- sanity check, if we don't have indices the test will take years to
  -- complete.  But skip TOAST relations (since they will have varying
***************
*** 165,172 ****
   timestamptz_tbl         | f
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (155 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 167,176 ----
   timestamptz_tbl         | f
   timetz_tbl              | f
   tinterval_tbl           | f
+  vaccluster              | t
+  vactst                  | f
   varchar_tbl             | f
! (157 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have

======================================================================

*** /home/postgres/pgreleases/postgresql/src/test/regress/expected/misc.out	2013-04-05 19:01:06.000000000 -0500
--- /home/postgres/pgreleases/postgresql/src/test/regress/results/misc.out	2013-04-05 19:02:55.000000000 -0500
***************
*** 694,702 ****
   tvv
   tvvm
   tvvmv
   varchar_tbl
   xacttest
! (118 rows)
  
  SELECT name(equipment(hobby_construct(text 'skywalking', text 'mer')));
   name 
--- 694,704 ----
   tvv
   tvvm
   tvvmv
+  vaccluster
+  vactst
   varchar_tbl
   xacttest
! (120 rows)
  
  SELECT name(equipment(hobby_construct(text 'skywalking', text 'mer')));
   name 

======================================================================

*** /home/postgres/pgreleases/postgresql/src/test/regress/expected/cluster.out	2013-04-04 11:20:41.000000000 -0500
--- /home/postgres/pgreleases/postgresql/src/test/regress/results/cluster.out	2013-04-05 19:03:20.000000000 -0500
***************
*** 53,93 ****
  -- This entry is needed to test that TOASTED values are copied correctly.
  INSERT INTO clstr_tst (b, c, d) VALUES (6, 'seis', repeat('xyzzy', 100000));
  CLUSTER clstr_tst_c ON clstr_tst;
  SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
   a  | b  |       c       |           substring            | length 
  ----+----+---------------+--------------------------------+--------
-  10 | 14 | catorce       |                                |       
-  18 |  5 | cinco         |                                |       
-   9 |  4 | cuatro        |                                |       
-  26 | 19 | diecinueve    |                                |       
-  12 | 18 | dieciocho     |                                |       
-  30 | 16 | dieciseis     |                                |       
-  24 | 17 | diecisiete    |                                |       
-   2 | 10 | diez          |                                |       
-  23 | 12 | doce          |                                |       
-  11 |  2 | dos           |                                |       
-  25 |  9 | nueve         |                                |       
-  31 |  8 | ocho          |                                |       
    1 | 11 | once          |                                |       
!  28 | 15 | quince        |                                |       
!  32 |  6 | seis          | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
!  29 |  7 | siete         |                                |       
!  15 | 13 | trece         |                                |       
!  22 | 30 | treinta       |                                |       
!  17 | 32 | treinta y dos |                                |       
    3 | 31 | treinta y uno |                                |       
    5 |  3 | tres          |                                |       
-  20 |  1 | uno           |                                |       
    6 | 20 | veinte        |                                |       
   14 | 25 | veinticinco   |                                |       
!  21 | 24 | veinticuatro  |                                |       
!   4 | 22 | veintidos     |                                |       
!  19 | 29 | veintinueve   |                                |       
   16 | 28 | veintiocho    |                                |       
   27 | 26 | veintiseis    |                                |       
!  13 | 27 | veintisiete   |                                |       
!   7 | 23 | veintitres    |                                |       
!   8 | 21 | veintiuno     |                                |       
  (32 rows)
  
  SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY a;
--- 53,95 ----
  -- This entry is needed to test that TOASTED values are copied correctly.
  INSERT INTO clstr_tst (b, c, d) VALUES (6, 'seis', repeat('xyzzy', 100000));
  CLUSTER clstr_tst_c ON clstr_tst;
+ WARNING:  page verification failed, calculated checksum 2699 but expected 58989
+ ERROR:  invalid page in block 0 of relation base/16384/31148
  SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
   a  | b  |       c       |           substring            | length 
  ----+----+---------------+--------------------------------+--------
    1 | 11 | once          |                                |       
!   2 | 10 | diez          |                                |       
    3 | 31 | treinta y uno |                                |       
+   4 | 22 | veintidos     |                                |       
    5 |  3 | tres          |                                |       
    6 | 20 | veinte        |                                |       
+   7 | 23 | veintitres    |                                |       
+   8 | 21 | veintiuno     |                                |       
+   9 |  4 | cuatro        |                                |       
+  10 | 14 | catorce       |                                |       
+  11 |  2 | dos           |                                |       
+  12 | 18 | dieciocho     |                                |       
+  13 | 27 | veintisiete   |                                |       
   14 | 25 | veinticinco   |                                |       
!  15 | 13 | trece         |                                |       
   16 | 28 | veintiocho    |                                |       
+  17 | 32 | treinta y dos |                                |       
+  18 |  5 | cinco         |                                |       
+  19 | 29 | veintinueve   |                                |       
+  20 |  1 | uno           |                                |       
+  21 | 24 | veinticuatro  |                                |       
+  22 | 30 | treinta       |                                |       
+  23 | 12 | doce          |                                |       
+  24 | 17 | diecisiete    |                                |       
+  25 |  9 | nueve         |                                |       
+  26 | 19 | diecinueve    |                                |       
   27 | 26 | veintiseis    |                                |       
!  28 | 15 | quince        |                                |       
!  29 |  7 | siete         |                                |       
!  30 | 16 | dieciseis     |                                |       
!  31 |  8 | ocho          |                                |       
!  32 |  6 | seis          | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
  (32 rows)
  
  SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY a;
***************
*** 206,243 ****
  SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
   a  |  b  |       c        |           substring            | length 
  ----+-----+----------------+--------------------------------+--------
-  10 |  14 | catorce        |                                |       
-  18 |   5 | cinco          |                                |       
-   9 |   4 | cuatro         |                                |       
-  26 |  19 | diecinueve     |                                |       
-  12 |  18 | dieciocho      |                                |       
-  30 |  16 | dieciseis      |                                |       
-  24 |  17 | diecisiete     |                                |       
-   2 |  10 | diez           |                                |       
-  23 |  12 | doce           |                                |       
-  11 |   2 | dos            |                                |       
-  25 |   9 | nueve          |                                |       
-  31 |   8 | ocho           |                                |       
    1 |  11 | once           |                                |       
!  28 |  15 | quince         |                                |       
!  32 |   6 | seis           | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
!  29 |   7 | siete          |                                |       
!  15 |  13 | trece          |                                |       
!  22 |  30 | treinta        |                                |       
!  17 |  32 | treinta y dos  |                                |       
    3 |  31 | treinta y uno  |                                |       
    5 |   3 | tres           |                                |       
-  20 |   1 | uno            |                                |       
    6 |  20 | veinte         |                                |       
   14 |  25 | veinticinco    |                                |       
!  21 |  24 | veinticuatro   |                                |       
!   4 |  22 | veintidos      |                                |       
!  19 |  29 | veintinueve    |                                |       
   16 |  28 | veintiocho     |                                |       
   27 |  26 | veintiseis     |                                |       
!  13 |  27 | veintisiete    |                                |       
!   7 |  23 | veintitres     |                                |       
!   8 |  21 | veintiuno      |                                |       
    0 | 100 | in child table |                                |       
  (33 rows)
  
--- 208,245 ----
  SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
   a  |  b  |       c        |           substring            | length 
  ----+-----+----------------+--------------------------------+--------
    1 |  11 | once           |                                |       
!   2 |  10 | diez           |                                |       
    3 |  31 | treinta y uno  |                                |       
+   4 |  22 | veintidos      |                                |       
    5 |   3 | tres           |                                |       
    6 |  20 | veinte         |                                |       
+   7 |  23 | veintitres     |                                |       
+   8 |  21 | veintiuno      |                                |       
+   9 |   4 | cuatro         |                                |       
+  10 |  14 | catorce        |                                |       
+  11 |   2 | dos            |                                |       
+  12 |  18 | dieciocho      |                                |       
+  13 |  27 | veintisiete    |                                |       
   14 |  25 | veinticinco    |                                |       
!  15 |  13 | trece          |                                |       
   16 |  28 | veintiocho     |                                |       
+  17 |  32 | treinta y dos  |                                |       
+  18 |   5 | cinco          |                                |       
+  19 |  29 | veintinueve    |                                |       
+  20 |   1 | uno            |                                |       
+  21 |  24 | veinticuatro   |                                |       
+  22 |  30 | treinta        |                                |       
+  23 |  12 | doce           |                                |       
+  24 |  17 | diecisiete     |                                |       
+  25 |   9 | nueve          |                                |       
+  26 |  19 | diecinueve     |                                |       
   27 |  26 | veintiseis     |                                |       
!  28 |  15 | quince         |                                |       
!  29 |   7 | siete          |                                |       
!  30 |  16 | dieciseis      |                                |       
!  31 |   8 | ocho           |                                |       
!  32 |   6 | seis           | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
    0 | 100 | in child table |                                |       
  (33 rows)
  
***************
*** 277,286 ****
  	AND indrelid=pg_class_2.oid
  	AND pg_class_2.relname = 'clstr_tst'
  	AND indisclustered;
!    relname   
! -------------
!  clstr_tst_c
! (1 row)
  
  -- Try changing indisclustered
  ALTER TABLE clstr_tst CLUSTER ON clstr_tst_b_c;
--- 279,287 ----
  	AND indrelid=pg_class_2.oid
  	AND pg_class_2.relname = 'clstr_tst'
  	AND indisclustered;
!  relname 
! ---------
! (0 rows)
  
  -- Try changing indisclustered
  ALTER TABLE clstr_tst CLUSTER ON clstr_tst_b_c;
***************
*** 323,338 ****
  CLUSTER clstr_2;
  ERROR:  there is no previously clustered index for table "clstr_2"
  CLUSTER clstr_1_pkey ON clstr_1;
  CLUSTER clstr_2 USING clstr_2_pkey;
  SELECT * FROM clstr_1 UNION ALL
    SELECT * FROM clstr_2 UNION ALL
    SELECT * FROM clstr_3;
   a 
  ---
-  1
   2
   1
   2
   2
   1
  (6 rows)
--- 324,343 ----
  CLUSTER clstr_2;
  ERROR:  there is no previously clustered index for table "clstr_2"
  CLUSTER clstr_1_pkey ON clstr_1;
+ WARNING:  page verification failed, calculated checksum 24058 but expected 20177
+ ERROR:  invalid page in block 0 of relation base/16384/31171
  CLUSTER clstr_2 USING clstr_2_pkey;
+ WARNING:  page verification failed, calculated checksum 20567 but expected 22105
+ ERROR:  invalid page in block 0 of relation base/16384/31175
  SELECT * FROM clstr_1 UNION ALL
    SELECT * FROM clstr_2 UNION ALL
    SELECT * FROM clstr_3;
   a 
  ---
   2
   1
   2
+  1
   2
   1
  (6 rows)
***************
*** 356,363 ****
    SELECT * FROM clstr_3;
   a 
  ---
-  1
   2
   2
   1
   2
--- 361,368 ----
    SELECT * FROM clstr_3;
   a 
  ---
   2
+  1
   2
   1
   2
***************
*** 369,379 ****
  INSERT INTO clstr_1 VALUES (2);
  INSERT INTO clstr_1 VALUES (1);
  CLUSTER clstr_1;
  SELECT * FROM clstr_1;
   a 
  ---
-  1
   2
  (2 rows)
  
  -- Test MVCC-safety of cluster. There isn't much we can do to verify the
--- 374,385 ----
  INSERT INTO clstr_1 VALUES (2);
  INSERT INTO clstr_1 VALUES (1);
  CLUSTER clstr_1;
+ ERROR:  there is no previously clustered index for table "clstr_1"
  SELECT * FROM clstr_1;
   a 
  ---
   2
+  1
  (2 rows)
  
  -- Test MVCC-safety of cluster. There isn't much we can do to verify the
***************
*** 405,429 ****
  (5 rows)
  
  CLUSTER clustertest_pkey ON clustertest;
  SELECT * FROM clustertest;
!  key 
! -----
!   20
!   30
!   35
!   80
!  100
! (5 rows)
! 
  COMMIT;
  SELECT * FROM clustertest;
   key 
  -----
    20
    30
!   35
!   80
!  100
  (5 rows)
  
  -- check that temp tables can be clustered
--- 411,429 ----
  (5 rows)
  
  CLUSTER clustertest_pkey ON clustertest;
+ WARNING:  page verification failed, calculated checksum 23239 but expected 48707
+ ERROR:  invalid page in block 0 of relation base/16384/31184
  SELECT * FROM clustertest;
! ERROR:  current transaction is aborted, commands ignored until end of transaction block
  COMMIT;
  SELECT * FROM clustertest;
   key 
  -----
+   10
    20
    30
!   40
!   50
  (5 rows)
  
  -- check that temp tables can be clustered

======================================================================

#24Jeff Davis
pgsql@j-davis.com
In reply to: Jaime Casanova (#23)
Re: corrupt pages detected by enabling checksums

On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote:

On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:

How does the attached version look? I verified that it survives
recovery, but not more.

I still got errors when executing make installcheck in a just compiled
9.3devel + this_patch, this is when setting wal_level higher than
minimal.
Attached regression.diffs

Did you also apply the other patch here:

/messages/by-id/1364340203.21411.143.camel@sussancws0025

?

I think that is a completely separate issue. Simon, should that patch be
applied or are you waiting on the issue Jeff Janes raised to be
resolved, as well?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Jaime Casanova
jaime@2ndquadrant.com
In reply to: Jeff Davis (#24)
Re: corrupt pages detected by enabling checksums

On Fri, Apr 5, 2013 at 7:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote:

On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:

How does the attached version look? I verified that it survives
recovery, but not more.

I still got errors when executing make installcheck in a just compiled
9.3devel + this_patch, this is when setting wal_level higher than
minimal.
Attached regression.diffs

Did you also apply the other patch here:

/messages/by-id/1364340203.21411.143.camel@sussancws0025

i missed that one... with both applied, first problems disappear...
will try some more tests later

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Davis (#21)
Re: corrupt pages detected by enabling checksums

On 2013-04-05 16:29:47 -0700, Jeff Davis wrote:

On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote:

How does the attached version look? I verified that it survives
recovery, but not more.

Comments:

* Regarding full page writes, we can:
- always write full pages (as in your current patch), regardless of
the current settings
- take WALInsertLock briefly to get the current settings from XLogCtl
- always calculate the full page record, but in XLogInsert, if it
happens to be a HINT record, and full page writes are not necessary,
then discard it (right now I somewhat favor this option but I haven't
looked at the details)

I feel pretty strongly that we shouldn't add any such complications to
XLogInsert() itself, its complicated enough already and it should be
made simpler, not more complicated.

I think we can just make up the rule that changing full page writes also
requires SpinLockAcquire(&xlogctl->info_lck);. Then its easy enough. And
it can hardly be a performance bottleneck given how infrequently its
modified.

In retrospect I think making up the rule that full_page_writes changes
imply a checkpoint would have made things easier performance and
codewise.

* typo in "Backup blocks are not used in **xlog xlog** records"

Thats just me being "funny" after some long days ;). See its an 'xlog'
'xlog record'.

* To get the appropriate setting for buffer_std, we should pass it down
through MarkBufferDirty as an extra flag, I think.

Or just declare as part of the api that only std buffers are allowed to
be passed down. After a quick look it looks like all callers use enough
of the standard page layout to make that viable. But that really was
just a quick look.

* I'm a bit worried that we'll need a cleanup lock when restoring an FSM
page. What do you think?

I don't yet see why, while recovery is ongoing there shouldn't be anyone
doing anything concurrently to it but startup itself?

* In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
Merely a matter of preference but I thought I would mention it.

Youre absolutely right, memcpy should have gotten passed 'data', not
XLogRecGetData().

Jeff, any chance you can run this for a round with your suite?

Yes. I don't have a rigorous test suite, but I'll do some manual tests
and walk through it with gdb.

Heh, in this and only this I was talking to Jeff Janes. Strangely I
never had noticed that you share the same name ;)

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#20)
Re: corrupt pages detected by enabling checksums

On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund <andres@2ndquadrant.com>wrote:

How does the attached version look? I verified that it survives
recovery, but not more.

Jeff, any chance you can run this for a round with your suite?

I've run it for a while now and have found no problems.

Thanks,

Jeff

#28Jaime Casanova
jaime@2ndquadrant.com
In reply to: Jeff Janes (#27)
Re: corrupt pages detected by enabling checksums

On Sat, Apr 6, 2013 at 1:36 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

How does the attached version look? I verified that it survives
recovery, but not more.

Jeff, any chance you can run this for a round with your suite?

I've run it for a while now and have found no problems.

fwiw, i have run installcheck (serial and parallel) and isolationtest,
also combination of those (one installcheck, one isolationtest) at the
same time while executing vacuum full, reindex database and manual
checkpoint...

i also check that the bgwriter was doing some work.

i did all of this in a master node in a cluster with a standby and a
cascade standby that were later promoted... and i have no problem with
checksums at all, so i would say that the combination of Jeff's and
Andres' patches fixed the problems we have seen until now

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#26)
Re: corrupt pages detected by enabling checksums

On 6 April 2013 15:44, Andres Freund <andres@2ndquadrant.com> wrote:

* In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
Merely a matter of preference but I thought I would mention it.

Youre absolutely right, memcpy should have gotten passed 'data', not
XLogRecGetData().

Applied, with this as the only code change.

Thanks everybody for good research and coding and fast testing.

We're in good shape now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#30Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#29)
2 attachment(s)
Re: corrupt pages detected by enabling checksums

On Mon, 2013-04-08 at 09:19 +0100, Simon Riggs wrote:

Applied, with this as the only code change.

Thanks everybody for good research and coding and fast testing.

We're in good shape now.

Thank you.

I have attached two more patches:

1. I believe that the issue I brought up at the end of this email:

/messages/by-id/1365035537.7580.380.camel@sussancws0025

is a real issue. In lazy_vacuum_page(), the following sequence can
happen when checksums are on:

a. PageSetAllVisible
b. Pass heap page to visibilitymap_set
c. visibilitymap_set logs the heap page and sets the LSN
d. MarkBufferDirty

If a checkpoint happens between (c) and (d), then we have a problem. The
fix is easy: just mark the heap buffer dirty first. There's another call
site that looks like a potential problem, but I don't think it is. I
simplified the code there to make it (hopefully) more clearly correct.

2. A cleanup patch to pass the buffer_std flag down through
MarkBufferDirtyHint. This is a matter of preference and purely cosmetic,
so it might not be wanted. The reason I thought it was useful is that a
future caller that sets a hint on a non-standard buffer might easily
miss the assumption that we have a standard buffer.

Regards,
Jeff Davis

Attachments:

checksums-buffer-std.patchtext/x-patch; charset=ISO-8859-1; name=checksums-buffer-std.patchDownload
*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***************
*** 287,293 **** hashgettuple(PG_FUNCTION_ARGS)
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf);
  		}
  
  		/*
--- 287,293 ----
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf, true);
  		}
  
  		/*
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 262,268 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer);
  		}
  	}
  
--- 262,268 ----
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer, true);
  		}
  	}
  
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 413,421 **** _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  					 * crucial. Be sure to mark the proper buffer dirty.
  					 */
  					if (nbuf != InvalidBuffer)
! 						MarkBufferDirtyHint(nbuf);
  					else
! 						MarkBufferDirtyHint(buf);
  				}
  			}
  		}
--- 413,421 ----
  					 * crucial. Be sure to mark the proper buffer dirty.
  					 */
  					if (nbuf != InvalidBuffer)
! 						MarkBufferDirtyHint(nbuf, true);
  					else
! 						MarkBufferDirtyHint(buf, true);
  				}
  			}
  		}
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 1052,1058 **** restart:
  				opaque->btpo_cycleid == vstate->cycleid)
  			{
  				opaque->btpo_cycleid = 0;
! 				MarkBufferDirtyHint(buf);
  			}
  		}
  
--- 1052,1058 ----
  				opaque->btpo_cycleid == vstate->cycleid)
  			{
  				opaque->btpo_cycleid = 0;
! 				MarkBufferDirtyHint(buf, true);
  			}
  		}
  
*** a/src/backend/access/nbtree/nbtutils.c
--- b/src/backend/access/nbtree/nbtutils.c
***************
*** 1789,1795 **** _bt_killitems(IndexScanDesc scan, bool haveLock)
  	if (killedsomething)
  	{
  		opaque->btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so->currPos.buf);
  	}
  
  	if (!haveLock)
--- 1789,1795 ----
  	if (killedsomething)
  	{
  		opaque->btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so->currPos.buf, true);
  	}
  
  	if (!haveLock)
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 7661,7667 **** XLogRestorePoint(const char *rpName)
   * i.e. those for which buffer_std == true
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer)
  {
  	XLogRecPtr recptr = InvalidXLogRecPtr;
  	XLogRecPtr lsn;
--- 7661,7667 ----
   * i.e. those for which buffer_std == true
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
  {
  	XLogRecPtr recptr = InvalidXLogRecPtr;
  	XLogRecPtr lsn;
***************
*** 7683,7689 **** XLogSaveBufferForHint(Buffer buffer)
  	 * We reuse and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = true;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
--- 7683,7689 ----
  	 * We reuse and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = buffer_std;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***************
*** 1118,1124 **** read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
  		HeapTupleHeaderSetXmax(seqtuple->t_data, InvalidTransactionId);
  		seqtuple->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
  		seqtuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
! 		MarkBufferDirtyHint(*buf);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(seqtuple);
--- 1118,1124 ----
  		HeapTupleHeaderSetXmax(seqtuple->t_data, InvalidTransactionId);
  		seqtuple->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
  		seqtuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
! 		MarkBufferDirtyHint(*buf, true);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(seqtuple);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 2582,2588 **** IncrBufferRefCount(Buffer buffer)
   *    (due to a race condition), so it cannot be used for important changes.
   */
  void
! MarkBufferDirtyHint(Buffer buffer)
  {
  	volatile BufferDesc *bufHdr;
  	Page	page = BufferGetPage(buffer);
--- 2582,2588 ----
   *    (due to a race condition), so it cannot be used for important changes.
   */
  void
! MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
  {
  	volatile BufferDesc *bufHdr;
  	Page	page = BufferGetPage(buffer);
***************
*** 2667,2673 **** MarkBufferDirtyHint(Buffer buffer)
  			 * rather than full transactionids.
  			 */
  			MyPgXact->delayChkpt = delayChkpt = true;
! 			lsn = XLogSaveBufferForHint(buffer);
  		}
  
  		LockBufHdr(bufHdr);
--- 2667,2673 ----
  			 * rather than full transactionids.
  			 */
  			MyPgXact->delayChkpt = delayChkpt = true;
! 			lsn = XLogSaveBufferForHint(buffer, buffer_std);
  		}
  
  		LockBufHdr(bufHdr);
*** a/src/backend/storage/freespace/freespace.c
--- b/src/backend/storage/freespace/freespace.c
***************
*** 216,222 **** XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
  		PageInit(page, BLCKSZ, 0);
  
  	if (fsm_set_avail(page, slot, new_cat))
! 		MarkBufferDirtyHint(buf);
  	UnlockReleaseBuffer(buf);
  }
  
--- 216,222 ----
  		PageInit(page, BLCKSZ, 0);
  
  	if (fsm_set_avail(page, slot, new_cat))
! 		MarkBufferDirtyHint(buf, true);
  	UnlockReleaseBuffer(buf);
  }
  
***************
*** 286,292 **** FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
  			return;				/* nothing to do; the FSM was already smaller */
  		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
! 		MarkBufferDirtyHint(buf);
  		UnlockReleaseBuffer(buf);
  
  		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
--- 286,292 ----
  			return;				/* nothing to do; the FSM was already smaller */
  		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
! 		MarkBufferDirtyHint(buf, true);
  		UnlockReleaseBuffer(buf);
  
  		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
***************
*** 619,625 **** fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
  	page = BufferGetPage(buf);
  
  	if (fsm_set_avail(page, slot, newValue))
! 		MarkBufferDirtyHint(buf);
  
  	if (minValue != 0)
  	{
--- 619,625 ----
  	page = BufferGetPage(buf);
  
  	if (fsm_set_avail(page, slot, newValue))
! 		MarkBufferDirtyHint(buf, true);
  
  	if (minValue != 0)
  	{
***************
*** 770,776 **** fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
  			{
  				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  				fsm_set_avail(BufferGetPage(buf), slot, child_avail);
! 				MarkBufferDirtyHint(buf);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  		}
--- 770,776 ----
  			{
  				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  				fsm_set_avail(BufferGetPage(buf), slot, child_avail);
! 				MarkBufferDirtyHint(buf, true);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  		}
*** a/src/backend/storage/freespace/fsmpage.c
--- b/src/backend/storage/freespace/fsmpage.c
***************
*** 284,290 **** restart:
  				exclusive_lock_held = true;
  			}
  			fsm_rebuild_page(page);
! 			MarkBufferDirtyHint(buf);
  			goto restart;
  		}
  	}
--- 284,290 ----
  				exclusive_lock_held = true;
  			}
  			fsm_rebuild_page(page);
! 			MarkBufferDirtyHint(buf, true);
  			goto restart;
  		}
  	}
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 121,127 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
  	}
  
  	tuple->t_infomask |= infomask;
! 	MarkBufferDirtyHint(buffer);
  }
  
  /*
--- 121,127 ----
  	}
  
  	tuple->t_infomask |= infomask;
! 	MarkBufferDirtyHint(buffer, true);
  }
  
  /*
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 267,273 **** extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  extern int XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
  extern int	XLogFileOpen(XLogSegNo segno);
  
! extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer);
  
  extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
--- 267,273 ----
  extern int XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
  extern int	XLogFileOpen(XLogSegNo segno);
  
! extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer, bool buffer_std);
  
  extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 204,210 **** extern Size BufferShmemSize(void);
  extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
  			 ForkNumber *forknum, BlockNumber *blknum);
  
! extern void MarkBufferDirtyHint(Buffer buffer);
  
  extern void UnlockBuffers(void);
  extern void LockBuffer(Buffer buffer, int mode);
--- 204,210 ----
  extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
  			 ForkNumber *forknum, BlockNumber *blknum);
  
! extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
  
  extern void UnlockBuffers(void);
  extern void LockBuffer(Buffer buffer, int mode);
checksums-markdirty.patchtext/x-patch; charset=ISO-8859-1; name=checksums-markdirty.patchDownload
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 901,926 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (all_visible)
  		{
! 			if (!PageIsAllVisible(page))
! 			{
! 				PageSetAllVisible(page);
! 				MarkBufferDirty(buf);
! 				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
! 								  vmbuffer, visibility_cutoff_xid);
! 			}
! 			else if (!all_visible_according_to_vm)
! 			{
! 				/*
! 				 * It should never be the case that the visibility map page is
! 				 * set while the page-level bit is clear, but the reverse is
! 				 * allowed.  Set the visibility map bit as well so that we get
! 				 * back in sync.
! 				 */
! 				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
! 								  vmbuffer, visibility_cutoff_xid);
! 			}
  		}
  
  		/*
--- 901,925 ----
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (all_visible && !all_visible_according_to_vm)
  		{
! 			/*
! 			 * It should never be the case that the visibility map page is set
! 			 * while the page-level bit is clear, but the reverse is allowed
! 			 * (if checksums are not enabled).  Regardless, set the both bits
! 			 * so that we get back in sync.
! 			 *
! 			 * NB: If the heap page is all-visible but the VM bit is not set,
! 			 * we don't need to dirty the heap page.  However, if checksums are
! 			 * enabled, we do need to make sure that the heap page is dirtied
! 			 * before passing it to visibilitymap_set(), because it may be
! 			 * logged.  Given that this situation should only happen in rare
! 			 * cases after a crash, it is not worth optimizing.
! 			 */
! 			PageSetAllVisible(page);
! 			MarkBufferDirty(buf);
! 			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
! 							  vmbuffer, visibility_cutoff_xid);
  		}
  
  		/*
***************
*** 1146,1151 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
--- 1145,1158 ----
  	PageRepairFragmentation(page);
  
  	/*
+ 	 * If checksums are enabled, visibilitymap_set() may log the heap page, so
+ 	 * we mark heap buffer dirty before calling visibilitmap_set().  We need to
+ 	 * mark the heap page dirty regardless, but the order only matters when
+ 	 * checksums are enabled.
+ 	 */
+ 	MarkBufferDirty(buffer);
+ 
+ 	/*
  	 * Now that we have removed the dead tuples from the page, once again check
  	 * if the page has become all-visible.
  	 */
***************
*** 1158,1165 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
  				visibility_cutoff_xid);
  	}
  
- 	MarkBufferDirty(buffer);
- 
  	/* XLOG stuff */
  	if (RelationNeedsWAL(onerel))
  	{
--- 1165,1170 ----
#31Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#26)
Re: corrupt pages detected by enabling checksums

On Sat, 2013-04-06 at 16:44 +0200, Andres Freund wrote:

I think we can just make up the rule that changing full page writes also
requires SpinLockAcquire(&xlogctl->info_lck);. Then its easy enough. And
it can hardly be a performance bottleneck given how infrequently its
modified.

That seems like a good idea to me. As it stands, checksums basically
force full page writes to be on; so we should either fix that or
document it.

In retrospect I think making up the rule that full_page_writes changes
imply a checkpoint would have made things easier performance and
codewise.

I don't even see why we allow changing that while the server is on.
Either the I/O system requires it for safety or not, right?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#26)
Re: corrupt pages detected by enabling checksums

On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I feel pretty strongly that we shouldn't add any such complications to
XLogInsert() itself, its complicated enough already and it should be
made simpler, not more complicated.

+1, emphatically. XLogInsert is a really nasty scalability
bottleneck. We need to move as much logic out of that function as
possible, and particularly out from under WALInsertLock.

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#32)
Re: corrupt pages detected by enabling checksums

On 11 April 2013 00:37, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

I feel pretty strongly that we shouldn't add any such complications to
XLogInsert() itself, its complicated enough already and it should be
made simpler, not more complicated.

+1, emphatically. XLogInsert is a really nasty scalability
bottleneck. We need to move as much logic out of that function as
possible, and particularly out from under WALInsertLock.

Andres' patch was applied, so not sure what you mean by +1ing a comment
made in relation to that patch.

I'm aware that WALInsertLock is a bottleneck and am not going to be making
that worse.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#30)
Re: corrupt pages detected by enabling checksums

On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote:

1. I believe that the issue I brought up at the end of this email:

/messages/by-id/1365035537.7580.380.camel@sussancws0025

is a real issue. In lazy_vacuum_page(), the following sequence can
happen when checksums are on:

a. PageSetAllVisible
b. Pass heap page to visibilitymap_set
c. visibilitymap_set logs the heap page and sets the LSN
d. MarkBufferDirty

If a checkpoint happens between (c) and (d), then we have a problem. The
fix is easy: just mark the heap buffer dirty first. There's another call
site that looks like a potential problem, but I don't think it is. I
simplified the code there to make it (hopefully) more clearly correct.

Applied

2. A cleanup patch to pass the buffer_std flag down through
MarkBufferDirtyHint. This is a matter of preference and purely cosmetic,
so it might not be wanted. The reason I thought it was useful is that a
future caller that sets a hint on a non-standard buffer might easily
miss the assumption that we have a standard buffer.

Skipped that for now. Do we really need it? Can you set a hint on a
non-standard buffer?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#34)
Re: corrupt pages detected by enabling checksums

On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote:

1. I believe that the issue I brought up at the end of this email:

/messages/by-id/1365035537.7580.380.camel@sussancws0025

is a real issue. In lazy_vacuum_page(), the following sequence can
happen when checksums are on:

a. PageSetAllVisible
b. Pass heap page to visibilitymap_set
c. visibilitymap_set logs the heap page and sets the LSN
d. MarkBufferDirty

If a checkpoint happens between (c) and (d), then we have a problem. The
fix is easy: just mark the heap buffer dirty first. There's another call
site that looks like a potential problem, but I don't think it is. I
simplified the code there to make it (hopefully) more clearly correct.

Applied

Uh, wait a minute. I think this is completely wrong. The buffer is
LOCKED for this entire sequence of operations. For a checkpoint to
"happen", it's got to write every buffer, which it will not be able to
do for so long as the buffer is locked.

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted. The change to
lazy_vacuum_page is harmless, but the comment is wrong.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#35)
Re: corrupt pages detected by enabling checksums

On 30 April 2013 13:34, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote:

1. I believe that the issue I brought up at the end of this email:

/messages/by-id/1365035537.7580.380.camel@sussancws0025

is a real issue. In lazy_vacuum_page(), the following sequence can
happen when checksums are on:

a. PageSetAllVisible
b. Pass heap page to visibilitymap_set
c. visibilitymap_set logs the heap page and sets the LSN
d. MarkBufferDirty

If a checkpoint happens between (c) and (d), then we have a problem. The
fix is easy: just mark the heap buffer dirty first. There's another call
site that looks like a potential problem, but I don't think it is. I
simplified the code there to make it (hopefully) more clearly correct.

Applied

Uh, wait a minute. I think this is completely wrong.

Thanks for your review.

The buffer is
LOCKED for this entire sequence of operations. For a checkpoint to
"happen", it's got to write every buffer, which it will not be able to
do for so long as the buffer is locked.

I was following the logic we use for WAL: mark the buffer dirty, then write WAL.

The important thing is for us to signal that the buffer should be
written, which we do by marking it dirty. The actual writing of the
buffer comes later, often minutes later.

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted.

OK, agreed. Will wait for this discussion to end before acting though,
so I'm sure exactly what you mean.

The change to
lazy_vacuum_page is harmless, but the comment is wrong.

In what way, wrong? What should it say?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#35)
Re: corrupt pages detected by enabling checksums

On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:

Uh, wait a minute. I think this is completely wrong. The buffer is
LOCKED for this entire sequence of operations. For a checkpoint to
"happen", it's got to write every buffer, which it will not be able to
do for so long as the buffer is locked.

I went back and forth on this, so you could be right, but here was my
reasoning:

I was worried because SyncOneBuffer checks whether it needs writing
without taking a content lock, so the exclusive lock doesn't help. That
makes sense, because you don't want a checkpoint to have to get a
content lock on every buffer in the buffer pool. But it also means we
need to follow the rules laid out in transam/README and dirty the pages
before writing WAL.

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted.

The only time the VM and the data page are out of sync during vacuum is
after a crash, right? If that's the case, I didn't think it was a big
deal to dirty one extra page (should be extremely rare). Am I missing
something?

The reason I removed that special case was just code
complexity/readability. I tried preserving the previous behavior, and
it's not so bad, but it seemed unnecessarily ugly for the benefit of a
rare case.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#37)
Re: corrupt pages detected by enabling checksums

On 30 April 2013 22:54, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:

Uh, wait a minute. I think this is completely wrong. The buffer is
LOCKED for this entire sequence of operations. For a checkpoint to
"happen", it's got to write every buffer, which it will not be able to
do for so long as the buffer is locked.

I went back and forth on this, so you could be right, but here was my
reasoning:

I was worried because SyncOneBuffer checks whether it needs writing
without taking a content lock, so the exclusive lock doesn't help. That
makes sense, because you don't want a checkpoint to have to get a
content lock on every buffer in the buffer pool. But it also means we
need to follow the rules laid out in transam/README and dirty the pages
before writing WAL.

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted.

The only time the VM and the data page are out of sync during vacuum is
after a crash, right? If that's the case, I didn't think it was a big
deal to dirty one extra page (should be extremely rare). Am I missing
something?

The reason I removed that special case was just code
complexity/readability. I tried preserving the previous behavior, and
it's not so bad, but it seemed unnecessarily ugly for the benefit of a
rare case.

All of that makes perfect sense to me.

Waiting to hear back from Robert whether he still has an objection.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#37)
Re: corrupt pages detected by enabling checksums

On Tue, Apr 30, 2013 at 5:54 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:

Uh, wait a minute. I think this is completely wrong. The buffer is
LOCKED for this entire sequence of operations. For a checkpoint to
"happen", it's got to write every buffer, which it will not be able to
do for so long as the buffer is locked.

I went back and forth on this, so you could be right, but here was my
reasoning:

I was worried because SyncOneBuffer checks whether it needs writing
without taking a content lock, so the exclusive lock doesn't help. That
makes sense, because you don't want a checkpoint to have to get a
content lock on every buffer in the buffer pool. But it also means we
need to follow the rules laid out in transam/README and dirty the pages
before writing WAL.

On further review, I think you're right about this. We'd have a
problem if the following sequence of events were to occur:

1. vacuum writes the WAL record, but does not yet mark the buffer
dirty, and then goes into the tank
2. checkpointer decides where the checkpoint will begin
3. buffer pool is scanned as part of the checkpoint process, observing
target buffer not to be dirty
4. vacuum finally wakes up and marks the buffer dirty
5. crash, after WAL is flushed but before buffer is written

However, on looking at this a little more, shouldn't we be doing
log_heap_clean() *before* we do visibilitymap_set()?

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted.

The only time the VM and the data page are out of sync during vacuum is
after a crash, right? If that's the case, I didn't think it was a big
deal to dirty one extra page (should be extremely rare). Am I missing
something?

The reason I removed that special case was just code
complexity/readability. I tried preserving the previous behavior, and
it's not so bad, but it seemed unnecessarily ugly for the benefit of a
rare case.

Well, I don't find it that ugly, but then again

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#39)
Re: corrupt pages detected by enabling checksums

On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I was worried because SyncOneBuffer checks whether it needs writing
without taking a content lock, so the exclusive lock doesn't help. That
makes sense, because you don't want a checkpoint to have to get a
content lock on every buffer in the buffer pool. But it also means we
need to follow the rules laid out in transam/README and dirty the pages
before writing WAL.

On further review, I think you're right about this. We'd have a
problem if the following sequence of events were to occur:

1. vacuum writes the WAL record, but does not yet mark the buffer
dirty, and then goes into the tank
2. checkpointer decides where the checkpoint will begin
3. buffer pool is scanned as part of the checkpoint process, observing
target buffer not to be dirty
4. vacuum finally wakes up and marks the buffer dirty
5. crash, after WAL is flushed but before buffer is written

However, on looking at this a little more, shouldn't we be doing
log_heap_clean() *before* we do visibilitymap_set()?

Hit send too soon.

To finish that thought: right now the order of operations is...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the visibility map change.
4. Log the cleanup operations.

It seems to me that it would be better to do (4) as close to (1) as
possible. Isn't it bad if the operations are replayed in an order
that differs from the order in which they were performed - or if, say,
the first WAL record were replayed but the second were not, for
whatever reason?

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted.

The only time the VM and the data page are out of sync during vacuum is
after a crash, right? If that's the case, I didn't think it was a big
deal to dirty one extra page (should be extremely rare). Am I missing
something?

The reason I removed that special case was just code
complexity/readability. I tried preserving the previous behavior, and
it's not so bad, but it seemed unnecessarily ugly for the benefit of a
rare case.

Well, I don't find it that ugly, but then again

...I've done a fair amount of hacking on this code. The scenario that
I find problematic here is that you could easily lose a couple of
visibility map pages in a crash. Each one you lose, with this patch,
potentially involves half a gigabyte of unnecessary disk writes, if
the relation is being vacuumed at the time. For the few extra lines
of code it takes to protect against that scenario, I think we should
protect against it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#40)
Re: corrupt pages detected by enabling checksums

On 1 May 2013 16:33, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I was worried because SyncOneBuffer checks whether it needs writing
without taking a content lock, so the exclusive lock doesn't help. That
makes sense, because you don't want a checkpoint to have to get a
content lock on every buffer in the buffer pool. But it also means we
need to follow the rules laid out in transam/README and dirty the pages
before writing WAL.

On further review, I think you're right about this. We'd have a
problem if the following sequence of events were to occur:

1. vacuum writes the WAL record, but does not yet mark the buffer
dirty, and then goes into the tank
2. checkpointer decides where the checkpoint will begin
3. buffer pool is scanned as part of the checkpoint process, observing
target buffer not to be dirty
4. vacuum finally wakes up and marks the buffer dirty
5. crash, after WAL is flushed but before buffer is written

However, on looking at this a little more, shouldn't we be doing
log_heap_clean() *before* we do visibilitymap_set()?

Hit send too soon.

To finish that thought: right now the order of operations is...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the visibility map change.
4. Log the cleanup operations.

It seems to me that it would be better to do (4) as close to (1) as
possible. Isn't it bad if the operations are replayed in an order
that differs from the order in which they were performed - or if, say,
the first WAL record were replayed but the second were not, for
whatever reason?

I agree, but that was in the original coding wasn't it?

Why aren't we writing just one WAL record for this action? We use a
single WAL record in other places where we make changes to multiple
blocks with multiple full page writes, e.g. index block split. That
would make the action atomic and we'd just have this...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the cleanup operations and visibility map change.

which can then be replayed with correct sequence, locking etc.
and AFAICS would likely be faster also.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#41)
Re: corrupt pages detected by enabling checksums

On Wed, May 1, 2013 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I agree, but that was in the original coding wasn't it?

I believe the problem was introduced by this commit:

commit fdf9e21196a6f58c6021c967dc5776a16190f295
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed Feb 13 17:46:23 2013 +0200

Update visibility map in the second phase of vacuum.

There's a high chance that a page becomes all-visible when the second phase
of vacuum removes all the dead tuples on it, so it makes sense to check for
that. Otherwise the visibility map won't get updated until the next vacuum.

Pavan Deolasee, reviewed by Jeff Janes.

Why aren't we writing just one WAL record for this action? We use a
single WAL record in other places where we make changes to multiple
blocks with multiple full page writes, e.g. index block split. That
would make the action atomic and we'd just have this...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the cleanup operations and visibility map change.

which can then be replayed with correct sequence, locking etc.
and AFAICS would likely be faster also.

I thought about that, too. It certainly seems like more than we want
to try to do for 9.3 at this point. The other complication is that
there's a lot of conditional logic here. We're definitely going to
emit a cleanup record. We're going to emit a record to make the page
all-visible only sometimes, because it might not be all-visible yet:
it could have tuples on it that are deleted but not yet dead. And
then there's additional logic to handle the checksum case. Plus, the
all-visible marking can happen in other code paths, too, specifically
in phase 1 of vacuum. So it might be possible to consolidate this,
but off-hand it looks messy to me out of proportion to the benefits.

Now that I'm looking at this, I'm a bit confused by the new logic in
visibilitymap_set(). When checksums are enabled, we set the page LSN,
which is described like this: "we need to protect the heap page from
being torn". But how does setting the page LSN do that? Don't we
need to mark the buffer dirty or something like that? Sorry if this
is a dumb question.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#40)
1 attachment(s)
Re: corrupt pages detected by enabling checksums

On Wed, 2013-05-01 at 11:33 -0400, Robert Haas wrote:

The only time the VM and the data page are out of sync during vacuum is
after a crash, right? If that's the case, I didn't think it was a big
deal to dirty one extra page (should be extremely rare). Am I missing
something?

The reason I removed that special case was just code
complexity/readability. I tried preserving the previous behavior, and
it's not so bad, but it seemed unnecessarily ugly for the benefit of a
rare case.

Well, I don't find it that ugly, but then again

...I've done a fair amount of hacking on this code. The scenario that
I find problematic here is that you could easily lose a couple of
visibility map pages in a crash. Each one you lose, with this patch,
potentially involves half a gigabyte of unnecessary disk writes, if
the relation is being vacuumed at the time. For the few extra lines
of code it takes to protect against that scenario, I think we should
protect against it.

I'm still unclear on how we'd lose so many VM bits at once, because they
are logged. I see how we can lose one if the heap page makes it to disk
before the VM bit's WAL is flushed, but that seems to be bounded to me.

Regardless, you have a reasonable claim that my patch had effects that
were not necessary. I have attached a draft patch to remedy that. Only
rudimentary testing was done.

Regards,
Jeff Davis

Attachments:

fixup.patchtext/x-patch; charset=UTF-8; name=fixup.patchDownload
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 5755,5762 **** log_heap_freeze(Relation reln, Buffer buffer,
   * corresponding visibility map block.	Both should have already been modified
   * and dirtied.
   *
!  * If checksums are enabled, we also add the heap_buffer to the chain to
!  * protect it from being torn.
   */
  XLogRecPtr
  log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
--- 5755,5762 ----
   * corresponding visibility map block.	Both should have already been modified
   * and dirtied.
   *
!  * If checksums are enabled and the heap buffer was changed (PD_ALL_VISIBLE
!  * set), we add it to the chain to protect it from being torn.
   */
  XLogRecPtr
  log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
***************
*** 5784,5790 **** log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
  	rdata[1].buffer_std = false;
  	rdata[1].next = NULL;
  
! 	if (DataChecksumsEnabled())
  	{
  		rdata[1].next = &(rdata[2]);
  
--- 5784,5790 ----
  	rdata[1].buffer_std = false;
  	rdata[1].next = NULL;
  
! 	if (DataChecksumsEnabled() && BufferIsValid(heap_buffer))
  	{
  		rdata[1].next = &(rdata[2]);
  
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 233,242 **** visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
   * marked all-visible; it is needed for Hot Standby, and can be
   * InvalidTransactionId if the page contains no tuples.
   *
!  * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
!  * this function. Except in recovery, caller should also pass the heap
!  * buffer. When checksums are enabled and we're not in recovery, we must add
!  * the heap buffer to the WAL chain to protect it from being torn.
   *
   * You must pass a buffer containing the correct map page to this function.
   * Call visibilitymap_pin first to pin the right one. This function doesn't do
--- 233,243 ----
   * marked all-visible; it is needed for Hot Standby, and can be
   * InvalidTransactionId if the page contains no tuples.
   *
!  * The caller should supply heapBuf if and only if it set PD_ALL_VISIBLE on the
!  * heap page, and we're not in recovery.  In that case, the caller should have
!  * already set PD_ALL_VISIBLE and marked the page dirty.  When checksums are
!  * enabled and heapBuf is valid, this function will add heapBuf to the WAL
!  * chain to protect it from being torn.
   *
   * You must pass a buffer containing the correct map page to this function.
   * Call visibilitymap_pin first to pin the right one. This function doesn't do
***************
*** 257,263 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
  #endif
  
  	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
- 	Assert(InRecovery || BufferIsValid(heapBuf));
  
  	/* Check that we have the right heap page pinned, if present */
  	if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
--- 258,263 ----
***************
*** 290,296 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
  				 * If data checksums are enabled, we need to protect the heap
  				 * page from being torn.
  				 */
! 				if (DataChecksumsEnabled())
  				{
  					Page heapPage = BufferGetPage(heapBuf);
  
--- 290,296 ----
  				 * If data checksums are enabled, we need to protect the heap
  				 * page from being torn.
  				 */
! 				if (DataChecksumsEnabled() && BufferIsValid(heapBuf))
  				{
  					Page heapPage = BufferGetPage(heapBuf);
  
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 894,918 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (all_visible && !all_visible_according_to_vm)
  		{
! 			/*
! 			 * It should never be the case that the visibility map page is set
! 			 * while the page-level bit is clear, but the reverse is allowed
! 			 * (if checksums are not enabled).  Regardless, set the both bits
! 			 * so that we get back in sync.
! 			 *
! 			 * NB: If the heap page is all-visible but the VM bit is not set,
! 			 * we don't need to dirty the heap page.  However, if checksums are
! 			 * enabled, we do need to make sure that the heap page is dirtied
! 			 * before passing it to visibilitymap_set(), because it may be
! 			 * logged.  Given that this situation should only happen in rare
! 			 * cases after a crash, it is not worth optimizing.
! 			 */
! 			PageSetAllVisible(page);
! 			MarkBufferDirty(buf);
! 			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
! 							  vmbuffer, visibility_cutoff_xid);
  		}
  
  		/*
--- 894,920 ----
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (all_visible)
  		{
! 			if (!PageIsAllVisible(page))
! 			{
! 				PageSetAllVisible(page);
! 				MarkBufferDirty(buf);
! 				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
! 								  vmbuffer, visibility_cutoff_xid);
! 			}
! 			else if (!all_visible_according_to_vm)
! 			{
! 				/*
! 				 * It should never be the case that the visibility map page is
! 				 * set while the page-level bit is clear, but the reverse is
! 				 * allowed.  Set the visibility map bit as well so that we get
! 				 * back in sync.
! 				 */
! 				visibilitymap_set(onerel, blkno, InvalidBuffer,
! 								  InvalidXLogRecPtr, vmbuffer,
! 								  visibility_cutoff_xid);
! 			}
  		}
  
  		/*
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#42)
Re: corrupt pages detected by enabling checksums

On 1 May 2013 19:16, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 1, 2013 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I agree, but that was in the original coding wasn't it?

I believe the problem was introduced by this commit:

commit fdf9e21196a6f58c6021c967dc5776a16190f295
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed Feb 13 17:46:23 2013 +0200

Update visibility map in the second phase of vacuum.

There's a high chance that a page becomes all-visible when the second phase
of vacuum removes all the dead tuples on it, so it makes sense to check for
that. Otherwise the visibility map won't get updated until the next vacuum.

Pavan Deolasee, reviewed by Jeff Janes.

Why aren't we writing just one WAL record for this action? We use a
single WAL record in other places where we make changes to multiple
blocks with multiple full page writes, e.g. index block split. That
would make the action atomic and we'd just have this...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the cleanup operations and visibility map change.

which can then be replayed with correct sequence, locking etc.
and AFAICS would likely be faster also.

I thought about that, too. It certainly seems like more than we want
to try to do for 9.3 at this point. The other complication is that
there's a lot of conditional logic here. We're definitely going to
emit a cleanup record. We're going to emit a record to make the page
all-visible only sometimes, because it might not be all-visible yet:
it could have tuples on it that are deleted but not yet dead. And
then there's additional logic to handle the checksum case. Plus, the
all-visible marking can happen in other code paths, too, specifically
in phase 1 of vacuum. So it might be possible to consolidate this,
but off-hand it looks messy to me out of proportion to the benefits.

Looks easy. There is no additional logic for checksums, so there's no
third complexity.

So we either have
* cleanup info with vismap setting info
* cleanup info only

which is the same number of WAL records as we have now, just that we
never emit 2 records when one will do.

Now that I'm looking at this, I'm a bit confused by the new logic in
visibilitymap_set(). When checksums are enabled, we set the page LSN,
which is described like this: "we need to protect the heap page from
being torn". But how does setting the page LSN do that?

It doesn't

Don't we
need to mark the buffer dirty or something like that?

We do.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#42)
Re: corrupt pages detected by enabling checksums

On Wed, 2013-05-01 at 14:16 -0400, Robert Haas wrote:

Now that I'm looking at this, I'm a bit confused by the new logic in
visibilitymap_set(). When checksums are enabled, we set the page LSN,
which is described like this: "we need to protect the heap page from
being torn". But how does setting the page LSN do that? Don't we
need to mark the buffer dirty or something like that? Sorry if this
is a dumb question.

The caller is supposed to dirty the heap page when setting
PD_ALL_VISIBLE. I thought I said that explicitly in the comments
somewhere, but now I don't see it.

It is slightly awkward to put the comment about the page being torn just
before setting the LSN. Feel free to move/remove/reword if you can think
of something better.

Because setting PD_ALL_VISIBLE does not write WAL by itself, my design
was to have it piggyback on the VM WAL record if checksums are turned
on. It would be nice to just use MarkBufferDirtyHint, but that does not
guarantee that the page will actually be marked dirty (see comments);
and we really need it to be properly marked dirty (otherwise we risk the
VM bit making it and the heap page not).

I also explored the idea of passing more options to MarkBufferDirty so
that we could force it to do what we want in each case, but decided
against it:

/messages/by-id/1357798000.5162.38.camel@jdavis-laptop

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#44)
Re: corrupt pages detected by enabling checksums

On Wed, 2013-05-01 at 20:06 +0100, Simon Riggs wrote:

Why aren't we writing just one WAL record for this action?

...

I thought about that, too. It certainly seems like more than we want
to try to do for 9.3 at this point. The other complication is that
there's a lot of conditional logic here.

...

Looks easy. There is no additional logic for checksums, so there's no
third complexity.

So we either have
* cleanup info with vismap setting info
* cleanup info only

which is the same number of WAL records as we have now, just that we
never emit 2 records when one will do.

What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
changes? Right now, that generates no WAL for the heap page (unless
checksums are enabled, of course), which means no full page images for
the heap pages.

I think we have to special-case that somehow, and I believe that's the
conditional logic that Robert is referring to.

That being said, there may be a simpler way to achieve the same result,
and that may be what you have in mind.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#46)
Re: corrupt pages detected by enabling checksums

On 1 May 2013 20:40, Jeff Davis <pgsql@j-davis.com> wrote:

Looks easy. There is no additional logic for checksums, so there's no
third complexity.

So we either have
* cleanup info with vismap setting info
* cleanup info only

which is the same number of WAL records as we have now, just that we
never emit 2 records when one will do.

What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
changes? Right now, that generates no WAL for the heap page (unless
checksums are enabled, of course), which means no full page images for
the heap pages.

The only place I see that code path is when we clear up a heap page
that is empty.

Is that what you meant?

Empty pages are pretty rare, so I guess I meant that we should just
get rid of the vismap update when we see that. Since we're adding the
block straught back into the fsm, it won't stay all visible for very
long.

Bottom line is: is there a problem there, or not?
If there's not, I'm happy. If there is, then do we have another plan
than making this into one WAL record, so it is atomic?
And presumably y'all want me to fix it.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#47)
Re: corrupt pages detected by enabling checksums

On Fri, 2013-05-03 at 19:52 +0100, Simon Riggs wrote:

On 1 May 2013 20:40, Jeff Davis <pgsql@j-davis.com> wrote:

Looks easy. There is no additional logic for checksums, so there's no
third complexity.

So we either have
* cleanup info with vismap setting info
* cleanup info only

which is the same number of WAL records as we have now, just that we
never emit 2 records when one will do.

What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
changes? Right now, that generates no WAL for the heap page (unless
checksums are enabled, of course), which means no full page images for
the heap pages.

The only place I see that code path is when we clear up a heap page
that is empty.

Or if there are no other changes to make to the heap page. For example,
if you do a fresh data load, then set the hint bits, and then do a
VACCUM. The only changes needed during VACUUM are setting PD_ALL_VISIBLE
and the VM bit.

Right now, that does not cause a wal record to be written for the heap
page, so there are no full-page images for the heap page.

At this point, I don't think more changes are required. Robert seemed
concerned about dirtying extra pages after a crash, but I didn't
entirely understand his reasoning. I am inclined toward simplifying this
part of the code as you suggest, but maybe that's better left for 9.4
(which would give me a chance to reignite the discussion about whether
PD_ALL_VISIBLE is needed at all) unless there is an actual problem.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#48)
Re: corrupt pages detected by enabling checksums

On 3 May 2013 21:53, Jeff Davis <pgsql@j-davis.com> wrote:

At this point, I don't think more changes are required.

After detailed further analysis, I agree, no further changes are required.

I think the code in that area needs considerable refactoring to
improve things. I've looked for an easy way to avoid calling
PageSetLSN() twice, but I don't see one, which is the thing I thought
was a problem. I don't much like the nested critical sections either.
But overall, we do follow the requirements for WAL laid out in the
README, in that we dirty the buffer first, insert WAL, then set LSN,
all within a critical section and with buffer locking. So I don't see
any place where this will break with the current coding.

In the future I would hope to see that code simplified so that we do
just one WAL record per block, rather than the 3 (or more?) records
that can get written now: freeze, clean and visibility.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#43)
Re: corrupt pages detected by enabling checksums

On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:

Regardless, you have a reasonable claim that my patch had effects that
were not necessary. I have attached a draft patch to remedy that. Only
rudimentary testing was done.

This looks reasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#50)
Re: corrupt pages detected by enabling checksums

On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote:

On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:

Regardless, you have a reasonable claim that my patch had effects that
were not necessary. I have attached a draft patch to remedy that. Only
rudimentary testing was done.

This looks reasonable to me.

Can you please explain the scenario that loses many VM bits at once
during a crash, and results in a bunch of already-all-visible heap pages
being dirtied for no reason?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#51)
Re: corrupt pages detected by enabling checksums

On Mon, May 6, 2013 at 5:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote:

On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:

Regardless, you have a reasonable claim that my patch had effects that
were not necessary. I have attached a draft patch to remedy that. Only
rudimentary testing was done.

This looks reasonable to me.

Can you please explain the scenario that loses many VM bits at once
during a crash, and results in a bunch of already-all-visible heap pages
being dirtied for no reason?

Hmm. Rereading your last email, I see your point: since we now have
HEAP_XLOG_VISIBLE, this is much less of an issue than it would have
been before. I'm still not convinced that simplifying that code is a
good idea, but maybe it doesn't really hurt us much in practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#52)
Re: corrupt pages detected by enabling checksums

On Tue, 2013-05-07 at 13:20 -0400, Robert Haas wrote:

Hmm. Rereading your last email, I see your point: since we now have
HEAP_XLOG_VISIBLE, this is much less of an issue than it would have
been before. I'm still not convinced that simplifying that code is a
good idea, but maybe it doesn't really hurt us much in practice.

Given that there's not a big impact one way or another, I don't mind
whether this particular patch is committed or not. Whichever you think
is more understandable and safer at this late hour. Also, to be clear,
the fact that I posted a patch was not meant to advocate either way;
merely to present the options.

Not sure exactly what you mean about the code simplification, but I
agree that anything more substantial than this patch should be left for
9.4.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Jim Nasby
jim@nasby.net
In reply to: Jeff Davis (#22)
Re: corrupt pages detected by enabling checksums

On 4/5/13 6:39 PM, Jeff Davis wrote:

On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote:

Maybe we could scan forward to check whether a corrupted WAL record is
followed by one or more valid ones with sensible LSNs. If it is,
chances are high that we haven't actually hit the end of the WAL. In
that case, we could either log a warning, or (better, probably) abort
crash recovery.

+1.

Corruption of fields which we require to scan past the record would
cause false negatives, i.e. no trigger an error even though we do
abort recovery mid-way through. There's a risk of false positives too,
but they require quite specific orderings of writes and thus seem
rather unlikely. (AFAICS, the OS would have to write some parts of
record N followed by the whole of record N+1 and then crash to cause a
false positive).

Does the xlp_pageaddr help solve this?

Also, we'd need to be a little careful when written-but-not-flushed WAL
data makes it to disk, which could cause a false positive and may be a
fairly common case.

Apologies if this is a stupid question, but is this mostly an issue due to torn pages? IOW, if we had a way to ensure we never see torn pages, would that mean an invalid CRC on a WAL page indicated there really was corruption on that page?

Maybe it's worth putting (yet more) thought into the torn page issue... :/
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Jeff Davis
pgsql@j-davis.com
In reply to: Jim Nasby (#54)
Re: corrupt pages detected by enabling checksums

On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote:

Apologies if this is a stupid question, but is this mostly an issue
due to torn pages? IOW, if we had a way to ensure we never see torn
pages, would that mean an invalid CRC on a WAL page indicated there
really was corruption on that page?

Maybe it's worth putting (yet more) thought into the torn page
issue... :/

Sort of. For data, a page is the logically-atomic unit that is expected
to be intact. For WAL, a record is the logically-atomic unit that is
expected to be intact.

So it might be better to say that the issue for the WAL is "torn
records". A record might be larger than a page (it can hold up to three
full-page images in one record), but is often much smaller.

We use a CRC to validate that the WAL record is fully intact. The
concern is that, if it fails the CRC check, we *assume* that it's
because it wasn't completely flushed yet (i.e. a "torn record"). Based
on that assumption, neither that record nor any later record contains
committed transactions, so we can safely consider that the end of the
WAL (as of the crash) and bring the system up.

The problem is that the assumption is not always true: a CRC failure
could also indicate real corruption of WAL records that have been
previously flushed successfully, and may contain committed transactions.
That can mean we bring the system up way too early, corrupting the
database.

Unfortunately, it seems that doing any kind of validation to determine
that we have a valid end-of-the-WAL inherently requires some kind of
separate durable write somewhere. It would be a tiny amount of data (an
LSN and maybe some extra crosscheck information), so I could imagine
that would be just fine given the right hardware; but if we just write
to disk that would be pretty bad. Ideas welcome.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Jim Nasby
jim@nasby.net
In reply to: Jeff Davis (#55)
Re: corrupt pages detected by enabling checksums

On 5/8/13 7:34 PM, Jeff Davis wrote:

On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote:

Apologies if this is a stupid question, but is this mostly an issue
due to torn pages? IOW, if we had a way to ensure we never see torn
pages, would that mean an invalid CRC on a WAL page indicated there
really was corruption on that page?

Maybe it's worth putting (yet more) thought into the torn page
issue... :/

Sort of. For data, a page is the logically-atomic unit that is expected
to be intact. For WAL, a record is the logically-atomic unit that is
expected to be intact.

So it might be better to say that the issue for the WAL is "torn
records". A record might be larger than a page (it can hold up to three
full-page images in one record), but is often much smaller.

We use a CRC to validate that the WAL record is fully intact. The
concern is that, if it fails the CRC check, we *assume* that it's
because it wasn't completely flushed yet (i.e. a "torn record"). Based
on that assumption, neither that record nor any later record contains
committed transactions, so we can safely consider that the end of the
WAL (as of the crash) and bring the system up.

The problem is that the assumption is not always true: a CRC failure
could also indicate real corruption of WAL records that have been
previously flushed successfully, and may contain committed transactions.
That can mean we bring the system up way too early, corrupting the
database.

Unfortunately, it seems that doing any kind of validation to determine
that we have a valid end-of-the-WAL inherently requires some kind of
separate durable write somewhere. It would be a tiny amount of data (an
LSN and maybe some extra crosscheck information), so I could imagine
that would be just fine given the right hardware; but if we just write
to disk that would be pretty bad. Ideas welcome.

What about moving some critical data from the beginning of the WAL record to the end? That would make it easier to detect that we don't have a complete record. It wouldn't necessarily replace the CRC though, so maybe that's not good enough.

Actually, what if we actually *duplicated* some of the same WAL header info at the end of the record? Given a reasonable amount of data that would damn-near ensure that a torn record was detected, because the odds of having the exact same sequence of random bytes would be so low. Potentially even just duplicating the LSN would suffice.

On the separate write idea, if that could be controlled by a GUC I think it'd be worth doing. Anyone that needs to worry about this corner case probably has hardware that would support that.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Jim Nasby (#56)
Re: corrupt pages detected by enabling checksums

On 9 May 2013 20:28, Jim Nasby <jim@nasby.net> wrote:

Unfortunately, it seems that doing any kind of validation to determine
that we have a valid end-of-the-WAL inherently requires some kind of
separate durable write somewhere. It would be a tiny amount of data (an
LSN and maybe some extra crosscheck information), so I could imagine
that would be just fine given the right hardware; but if we just write
to disk that would be pretty bad. Ideas welcome.

Not so sure.

If the WAL record length is intact, and it probably is, then we can
test whether the next WAL record is valid also.

If the current WAL record is corrupt and the next WAL record is
corrupt, then we have a problem.

If the current WAL record is corrupt and the next WAL record is in
every way valid, we can potentially continue. But we need to keep
track of accumulated errors to avoid getting into a worse situation.
Obviously, we would need to treat the next WAL record with complete
scepticism, but I have seen cases where only a single WAL record was
corrupt.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#57)
Re: corrupt pages detected by enabling checksums

Simon Riggs <simon@2ndQuadrant.com> writes:

If the current WAL record is corrupt and the next WAL record is in
every way valid, we can potentially continue.

That seems like a seriously bad idea.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#58)
Re: corrupt pages detected by enabling checksums

On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

If the current WAL record is corrupt and the next WAL record is in
every way valid, we can potentially continue.

That seems like a seriously bad idea.

I agree. But if you knew that were true, is stopping a better idea?

Any corrupt WAL record needs to halt recovery. What I'm thinking is to
check whether it might be possible to continue, and if all looks good,
offer the user the option to continue.

"Continuing could be dangerous and might cause deeper corruption of
your database. We suggest you take a backup of the server before
continuing"

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Greg Stark
stark@mit.edu
In reply to: Simon Riggs (#59)
Re: corrupt pages detected by enabling checksums

On Thu, May 9, 2013 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

If the current WAL record is corrupt and the next WAL record is in
every way valid, we can potentially continue.

That seems like a seriously bad idea.

I agree. But if you knew that were true, is stopping a better idea?

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I think it's not possible to protect 100% against this without giving
up the checksum optimization which implies doing two fsyncs per commit
instead of 1.

However it is possible to reduce the window. Every time the
transaction log is synced a different file can be updated with the a
known minimum transaction log recovery point. Even if it's not synced
consistently on every transaction commit or wal sync it would serve as
a low water mark. Recovering to that point is not sufficient but is
necessary for a consistent recovery. That file could be synced lazily,
say, every 10s or something like that and would guarantee that any wal
corruption would be caught except for the last 10s of wal traffic for
example.

If you're only interested in database consistency and not lost commits
then that file could be synced on buffer xlog flushes (making a
painful case even more painful). Off the top of my head that would be
sufficient to guarantee that a corrupt xlog that would create an
inconsistent database would not be missed. I may be missing cases
involving checkpoints or the like though.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Jeff Davis
pgsql@j-davis.com
In reply to: Jim Nasby (#56)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:

What about moving some critical data from the beginning of the WAL
record to the end? That would make it easier to detect that we don't
have a complete record. It wouldn't necessarily replace the CRC
though, so maybe that's not good enough.

Actually, what if we actually *duplicated* some of the same WAL header
info at the end of the record? Given a reasonable amount of data that
would damn-near ensure that a torn record was detected, because the
odds of having the exact same sequence of random bytes would be so
low. Potentially even just duplicating the LSN would suffice.

I think both of these ideas have some false positives and false
negatives.

If the corruption happens at the record boundary, and wipes out the
special information at the end of the record, then you might think it
was not fully flushed, and we're in the same position as today.

If the WAL record is large, and somehow the beginning and the end get
written to disk but not the middle, then it will look like corruption;
but really the WAL was just not completely flushed. This seems pretty
unlikely, but not impossible.

That being said, I like the idea of introducing some extra checks if a
perfect solution is not possible.

On the separate write idea, if that could be controlled by a GUC I
think it'd be worth doing. Anyone that needs to worry about this
corner case probably has hardware that would support that.

It sounds pretty easy to do that naively. I'm just worried that the
performance will be so bad for so many users that it's not a very
reasonable choice.

Today, it would probably make more sense to just use sync rep. If the
master's WAL is corrupt, and it starts up too early, then that should be
obvious when you try to reconnect streaming replication. I haven't tried
it, but I'm assuming that it gives a useful error message.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Jeff Davis
pgsql@j-davis.com
In reply to: Greg Stark (#60)
Re: corrupt pages detected by enabling checksums

On Thu, 2013-05-09 at 23:13 +0100, Greg Stark wrote:

However it is possible to reduce the window...

Sounds reasonable.

It's fairly limited though -- the window is already a checkpoint
(typically 5-30 minutes), and we'd bring that down an order of magnitude
(10s). I speculate that, if it got corrupted within 30 minutes, it
probably got corrupted at the time of being written (as happened in Jeff
Janes's case, due to a bug).

So, the question is: if the WAL is corrupted on write, does reducing the
window significantly increase the chances that the wal writer will hang
around long enough before a crash to flush this other file?

On the other hand, checkpoint hides any corrupt WAL records by not
replaying them, whereas your scheme would identify that there is a
problem.

I don't think this would have helped Jeff Janes's case because I think
the crashes were happening too quickly. But that is artificial, so it
may help in real cases.

I just had a thought: we don't necessarily need to flush the auxiliary
file each time; merely writing it to the kernel buffers would help a
lot. Maybe an extra write() of the auxiliary file during a WAL flush
isn't so bad; and combined with periodic fsync()s of the auxiliary file,
should offer a lot of coverage against problems.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#60)
Re: corrupt pages detected by enabling checksums

On 9 May 2013 23:13, Greg Stark <stark@mit.edu> wrote:

On Thu, May 9, 2013 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

If the current WAL record is corrupt and the next WAL record is in
every way valid, we can potentially continue.

That seems like a seriously bad idea.

I agree. But if you knew that were true, is stopping a better idea?

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I disagree, that *is* an abnormal situation and would not be the
"correct end-of-WAL".

Each WAL record contains a "prev" pointer to the last WAL record. So
for the next record to be valid the prev pointer would need to be
exactly correct.

However it is possible to reduce the window. Every time the
transaction log is synced a different file can be updated with the a
known minimum transaction log recovery point. Even if it's not synced
consistently on every transaction commit or wal sync it would serve as
a low water mark. Recovering to that point is not sufficient but is
necessary for a consistent recovery. That file could be synced lazily,
say, every 10s or something like that and would guarantee that any wal
corruption would be caught except for the last 10s of wal traffic for
example.

I think it would be easy enough to have the WALwriter update the
minRecoveryPoint once per cycle, after it has flushed WAL.

Given the importance of pg_control and its small size, it seems like
it would be a good idea to take a backup copy of it every checkpoint
to make sure we have that data safe. And have pg_resetxlog keep a copy
also every time that is run.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Greg Stark
stark@mit.edu
In reply to: Simon Riggs (#63)
Re: corrupt pages detected by enabling checksums

On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I disagree, that *is* an abnormal situation and would not be the
"correct end-of-WAL".

Each WAL record contains a "prev" pointer to the last WAL record. So
for the next record to be valid the prev pointer would need to be
exactly correct.

Well then you're wrong. The OS makes no guarantee that blocks are
written out in order. When the system crashes any random subset of the
blocks written but not synced might have been written to disk and
others not. There could be megabytes of correct WAL written with just
one block in the middle of the first record not written. If no xlog
sync had occurred (or one was in progress but not completed) then
that's the correct end of WAL.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Amit Kapila
amit.kapila@huawei.com
In reply to: Greg Stark (#64)
Re: corrupt pages detected by enabling checksums

On Friday, May 10, 2013 6:09 PM Greg Stark wrote:

On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com>
wrote:

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I disagree, that *is* an abnormal situation and would not be the
"correct end-of-WAL".

Each WAL record contains a "prev" pointer to the last WAL record. So
for the next record to be valid the prev pointer would need to be
exactly correct.

Well then you're wrong. The OS makes no guarantee that blocks are
written out in order. When the system crashes any random subset of the
blocks written but not synced might have been written to disk and
others not. There could be megabytes of correct WAL written with just
one block in the middle of the first record not written. If no xlog
sync had occurred (or one was in progress but not completed) then
that's the correct end of WAL.

In the case where one block is missing, how can it even reach to next record
to check "prev" pointer.
I think it can be possible when one of the record is corrupt and following
are okay which I think is the
case in which it can proceed with warning as suggested by Simon.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Greg Stark
stark@mit.edu
In reply to: Jeff Janes (#1)
Re: corrupt pages detected by enabling checksums

On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

In the case where one block is missing, how can it even reach to next record
to check "prev" pointer.
I think it can be possible when one of the record is corrupt and following
are okay which I think is the
case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

If you replayed the following record but not this record you would
have an inconsistent database. The following record could be an insert
for a child table with a foreign key reference to a tuple that was
inserted by the skipped record for example. Resulting in a database
that is logically inconsistent.

Or it could be an index insert for that tuple would would result in a
physically inconsistent database with index pointers that point to
incorrect tuples. Index scans would return tuples that didn't match
the index or would miss tuples that should be returned.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#66)
Re: corrupt pages detected by enabling checksums

Greg Stark <stark@mit.edu> writes:

A single WAL record can be over 24kB.

<pedantic>
Actually, WAL records can run to megabytes. Consider for example a
commit record for a transaction that dropped thousands of tables ---
there'll be info about each such table in the commit record, to cue
replay to remove those files.
</pedantic>

If you replayed the following record but not this record you would
have an inconsistent database. ...
Or it could be an index insert for that tuple would would result in a
physically inconsistent database with index pointers that point to
incorrect tuples. Index scans would return tuples that didn't match
the index or would miss tuples that should be returned.

Skipping actions such as index page splits would lead to even more fun.

Even in simple cases such as successive inserts and deletions in the
same heap page, failing to replay some of the actions is going to be
disastrous. The *best case* scenario for that is that WAL replay
PANICs when it notices that the action it's trying to replay is
inconsistent with the current state of the page, eg it's trying to
insert at a TID that already exists.

IMO we can't proceed past a broken WAL record. The actually useful
suggestion upthread was that we try to notice whether there seem
to be valid WAL records past the broken one, so that we could warn
the DBA that some commits might have been lost. I don't think we
can do much in the way of automatic data recovery, but we could give
the DBA a chance to do forensics rather than blindly starting up (and
promptly overwriting all the evidence).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#64)
Re: corrupt pages detected by enabling checksums

On 10 May 2013 13:39, Greg Stark <stark@mit.edu> wrote:

On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I disagree, that *is* an abnormal situation and would not be the
"correct end-of-WAL".

Each WAL record contains a "prev" pointer to the last WAL record. So
for the next record to be valid the prev pointer would need to be
exactly correct.

Well then you're wrong. The OS makes no guarantee that blocks are
written out in order. When the system crashes any random subset of the
blocks written but not synced might have been written to disk and
others not. There could be megabytes of correct WAL written with just
one block in the middle of the first record not written. If no xlog
sync had occurred (or one was in progress but not completed) then
that's the correct end of WAL.

I agree that the correct end of WAL is where the last sync occurred.
We don't write() WAL except with an immediate sync(), so the chances
of what you say happening are very low to impossible. The timing
window between the write and the sync is negligible and yet I/O would
need to occur in that window and also be out of order from the order
of the write, which is unlikely because an I/O elevator would either
not touch the order of writes at all, or would want to maintain
sequential order to avoid head movement, which is what we want. I
guess we should add here "...with disks, maybe not with SSDs".

In any case, what is more important is that your idea to make an
occasional write of the minRecoveryPoint and then use that to cross
check against current LSN would allow us to at least confirm that we
have a single corrupt record and report that situation accurately to
the user. That idea will cover 95+% of such problems anyway, since
what we care about is long sequences of WAL records, not just the last
few written at crash, which the above discussion focuses on.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#67)
Re: corrupt pages detected by enabling checksums

On 10 May 2013 18:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:

A single WAL record can be over 24kB.

<pedantic>
Actually, WAL records can run to megabytes. Consider for example a
commit record for a transaction that dropped thousands of tables ---
there'll be info about each such table in the commit record, to cue
replay to remove those files.
</pedantic>

If you replayed the following record but not this record you would
have an inconsistent database. ...
Or it could be an index insert for that tuple would would result in a
physically inconsistent database with index pointers that point to
incorrect tuples. Index scans would return tuples that didn't match
the index or would miss tuples that should be returned.

Skipping actions such as index page splits would lead to even more fun.

Even in simple cases such as successive inserts and deletions in the
same heap page, failing to replay some of the actions is going to be
disastrous. The *best case* scenario for that is that WAL replay
PANICs when it notices that the action it's trying to replay is
inconsistent with the current state of the page, eg it's trying to
insert at a TID that already exists.

Agreed

IMO we can't proceed past a broken WAL record. The actually useful
suggestion upthread was that we try to notice whether there seem
to be valid WAL records past the broken one, so that we could warn
the DBA that some commits might have been lost. I don't think we
can do much in the way of automatic data recovery, but we could give
the DBA a chance to do forensics rather than blindly starting up (and
promptly overwriting all the evidence).

The usual answer is switchover to the standby, hoping that the WAL was
not corrupted before it got sent there also.

If all servers go down people will want a "carry on regardless" option
as well, since the fault can be investigated on a standby. Even with
all of the above caveats, lying on our backs with our feet in the air
'cos we lost a few blocks will not impress anyone. Doing that will
likely be a medium-high risk thing, but that will still be better than
the certainty of a down server. It would need to be a manually iniated
option.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Jeff Janes
jeff.janes@gmail.com
In reply to: Greg Stark (#66)
Re: corrupt pages detected by enabling checksums

On Fri, May 10, 2013 at 9:54 AM, Greg Stark <stark@mit.edu> wrote:

On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

In the case where one block is missing, how can it even reach to next

record

to check "prev" pointer.
I think it can be possible when one of the record is corrupt and

following

are okay which I think is the
case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

If you replayed the following record but not this record you would
have an inconsistent database.

I don't think we would ever want to *skip* the record and play the next
one. But if it looks like the next record is valid, we might not want to
automatically open the database in a possibly inconsistent state and in the
process overwrite the only existing copy of those WAL records which would
be necessary to make it consistent. Instead, could we present the DBA with
an explicit choice to either open the database, or try to reconstruct the
corrupted record via forensic inspection so that it can be played through
(I have no idea how likely it is that such an attempt would succeed), or to
copy the database for later inspection and then open it.

But based on your description, perhaps refusing to automatically restart
and forcing an explicit decision would happen a lot more often, during
normal crashes with no corruption, than I was thinking it would.

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file. They
would also have to turn off any startup scripts in init.d, to make sure a
rebooting server doesn't do recovery automatically and destroy evidence
that way.

Cheers,

Jeff

#71Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#70)
Re: corrupt pages detected by enabling checksums

On Fri, May 10, 2013 at 2:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

But based on your description, perhaps refusing to automatically restart and
forcing an explicit decision would happen a lot more often, during normal
crashes with no corruption, than I was thinking it would.

I bet it would. But I think Greg Stark's idea of periodically
flushing the current minimum recovery point to disk has a lot to
recommend it. That would be massively more reliable than what we have
now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#68)
Re: corrupt pages detected by enabling checksums

On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote:

We don't write() WAL except with an immediate sync(), so the chances
of what you say happening are very low to impossible.

Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I
assume they can be different.

I agree that it sounds unlikely that blocks 100 and 102 would be
written, but not 101. But perhaps that's more likely in systems like ZFS
where the physical blocks might be in very different places.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Greg Smith
greg@2ndQuadrant.com
In reply to: Simon Riggs (#68)
Re: corrupt pages detected by enabling checksums

On 5/10/13 1:32 PM, Simon Riggs wrote:

The timing
window between the write and the sync is negligible and yet I/O would
need to occur in that window and also be out of order from the order
of the write, which is unlikely because an I/O elevator would either
not touch the order of writes at all, or would want to maintain
sequential order to avoid head movement, which is what we want. I
guess we should add here "...with disks, maybe not with SSDs".

It's not really safe to make any assumptions about I/O elevators.
Reordering gets done from the perspective of the last item written.
When the previous write was at the logical end of the disk, it can just
as easily re-order a queue of writes in the complete reverse order they
were issued in.

The only way you can ever get a useful guarantee is when an fsync
returns completion. Writes can effectively go out in a completely
random order until that point. All you can rely on is throwing up a
stop sign that says "tell me when all of them are done". In between
those, you have no idea of the ordering.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Amit kapila
amit.kapila@huawei.com
In reply to: Greg Stark (#66)
Re: corrupt pages detected by enabling checksums

On Friday, May 10, 2013 10:24 PM Greg Stark wrote:
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

In the case where one block is missing, how can it even reach to next record
to check "prev" pointer.
I think it can be possible when one of the record is corrupt and following
are okay which I think is the
case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

The only point I wanted to say was it can be only "one such record",
length can be large or small.

If you replayed the following record but not this record you would
have an inconsistent database. The following record could be an insert
for a child table with a foreign key reference to a tuple that was
inserted by the skipped record for example. Resulting in a database
that is logically inconsistent.

The corrupt record can be such that it can lead to inconsistency in database or
it could be a commit record of transaction which has performed only select operation.
It will be difficult or not possible to find such information during recovery,
but informing DBA/user at such occasion can be useful and with an optional way for him to
continue (although I am not sure how in such a case DBA can decide, may be need some other information as well).
The reason why it can be useful to allow DBA/user intervention in such cases is that, in case
of ignoring data after one corrupt record, it can still take time for DBA/user to find
which of the operations he needs to perform again.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#72)
Re: corrupt pages detected by enabling checksums

On 10 May 2013 23:41, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote:

We don't write() WAL except with an immediate sync(), so the chances
of what you say happening are very low to impossible.

Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I
assume they can be different.

I'm answering this just to complete the discussion.

Yes, the write and flush pointer can be different. The write/flush are
two actions; we do one first, then the other, very quickly, inside
XLogWrite().

But we cannot rely on that, since there are some times when we don't
do that, such as wal_buffer overflow and when background walwriter
writes are at the end of the ring buffer. Not common, but they do
exist and when they exist they write many blocks. Which is counter to
what I had said earlier.

This part of the proposal isn't necessary for us to get a good answer
95% of the time, so it is dropped.

As mentioned on other post, we can write
UpdateMinRecoveryPoint(LogwrtResult.Flush, false) every time we do
XLogBackgroundFlush() and some other rework to make that happen
correctly. I'll post a separate patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Jim Nasby
jim@nasby.net
In reply to: Jeff Davis (#61)
Re: corrupt pages detected by enabling checksums

On 5/9/13 5:18 PM, Jeff Davis wrote:

On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:

What about moving some critical data from the beginning of the WAL
record to the end? That would make it easier to detect that we don't
have a complete record. It wouldn't necessarily replace the CRC
though, so maybe that's not good enough.

Actually, what if we actually *duplicated* some of the same WAL header
info at the end of the record? Given a reasonable amount of data that
would damn-near ensure that a torn record was detected, because the
odds of having the exact same sequence of random bytes would be so
low. Potentially even just duplicating the LSN would suffice.

I think both of these ideas have some false positives and false
negatives.

If the corruption happens at the record boundary, and wipes out the
special information at the end of the record, then you might think it
was not fully flushed, and we're in the same position as today.

If the WAL record is large, and somehow the beginning and the end get
written to disk but not the middle, then it will look like corruption;
but really the WAL was just not completely flushed. This seems pretty
unlikely, but not impossible.

That being said, I like the idea of introducing some extra checks if a
perfect solution is not possible.

Yeah, I don't think a perfect solution is possible, short of attempting to tie directly into the filesystem (ie: on a journaling FS have some way to essentially treat the FS journal as WAL).

One additional step we might be able to take would be to scan forward looking for a record that would tell us when an fsync must have occurred (heck, maybe we should add an fsync WAL record...). If we find a corrupt WAL record followed by an fsync we know that we've now lost data. That closes some of the holes. Actually, that might handle all the holes...

On the separate write idea, if that could be controlled by a GUC I
think it'd be worth doing. Anyone that needs to worry about this
corner case probably has hardware that would support that.

It sounds pretty easy to do that naively. I'm just worried that the
performance will be so bad for so many users that it's not a very
reasonable choice.

Today, it would probably make more sense to just use sync rep. If the
master's WAL is corrupt, and it starts up too early, then that should be
obvious when you try to reconnect streaming replication. I haven't tried
it, but I'm assuming that it gives a useful error message.

I wonder if there are DW environments that are too large to keep a SR copy but would be able to afford the double-write overhead.

BTW, isn't performance what killed the double-buffer idea?
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Jim Nasby
jim@nasby.net
In reply to: Jeff Janes (#70)
Re: corrupt pages detected by enabling checksums

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in that case the database would refuse to restart even in the case where it perfectly clear that all the following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just disable reuse). I know there's a performance hit there, but the reuse idea happened before we had bgWriter. Theoretically the overhead creating a new file would always fall to bgWriter and therefore not be a big deal.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jim Nasby (#77)
Re: corrupt pages detected by enabling checksums

On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just
disable reuse). I know there's a performance hit there, but the reuse idea
happened before we had bgWriter. Theoretically the overhead creating a new
file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79ktm@rice.edu
ktm@rice.edu
In reply to: Jim Nasby (#77)
Re: corrupt pages detected by enabling checksums

On Sun, May 12, 2013 at 03:46:00PM -0500, Jim Nasby wrote:

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in that case the database would refuse to restart even in the case where it perfectly clear that all the following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just disable reuse). I know there's a performance hit there, but the reuse idea happened before we had bgWriter. Theoretically the overhead creating a new file would always fall to bgWriter and therefore not be a big deal.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

Unless something has changed dramtically, creating new files is a LOT more
overhead than reusing existing files. My two cents.

Regards,
Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80ktm@rice.edu
ktm@rice.edu
In reply to: Jon Nelson (#78)
Re: corrupt pages detected by enabling checksums

On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote:

On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just
disable reuse). I know there's a performance hit there, but the reuse idea
happened before we had bgWriter. Theoretically the overhead creating a new
file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.

--
Jon

What about for less cutting (bleeding) edge filesystems?

Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: ktm@rice.edu (#80)
Re: corrupt pages detected by enabling checksums

On Mon, May 13, 2013 at 7:49 AM, ktm@rice.edu <ktm@rice.edu> wrote:

On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote:

On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just
disable reuse). I know there's a performance hit there, but the reuse idea
happened before we had bgWriter. Theoretically the overhead creating a new
file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.

What about for less cutting (bleeding) edge filesystems?

The patch would be applicable for any filesystem which implements the
fallocate/posix_fallocate calls in an efficient fashion. xfs and ext4
would both work, if I recall properly. I'm certain there are others,
and the technique would probably work on other operating systems like
the *BSDs, etc.. Additionally, it's improbable that there would be a
performance hit for other filesystems versus simply writing zeroes,
since that's the approach that is taken anyway as a fallback. Another
win is reduction in fragmentation. I would love to be able to turn off
WAL recycling to perform more useful testing.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Andres Freund
andres@2ndquadrant.com
In reply to: Jon Nelson (#78)
Re: corrupt pages detected by enabling checksums

On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:

On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just
disable reuse). I know there's a performance hit there, but the reuse idea
happened before we had bgWriter. Theoretically the overhead creating a new
file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.

I don't think the comparison between just fallocate()ing and what we
currently do is fair. fallocate() doesn't guarantee that the file is the
same size after a crash, so you would still need an fsync() or we
couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
that big anymore in that case?
That said, using posix_fallocate seems like a good idea in lots of
places inside pg, its just not all that easy to do in some of the
places.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Andres Freund (#82)
Re: corrupt pages detected by enabling checksums

On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:

On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just
disable reuse). I know there's a performance hit there, but the reuse idea
happened before we had bgWriter. Theoretically the overhead creating a new
file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.

I don't think the comparison between just fallocate()ing and what we
currently do is fair. fallocate() doesn't guarantee that the file is the
same size after a crash, so you would still need an fsync() or we
couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
that big anymore in that case?

fallocate (16MB) + fsync is still almost certainly faster than
write+write+write... + fsync.
The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.

That said, using posix_fallocate seems like a good idea in lots of
places inside pg, its just not all that easy to do in some of the
places.

I should not derail this thread any further. Perhaps, if interested
parties would like to discuss the use of fallocate/posix_fallocate, a
new thread might be more appropriate?

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Andres Freund
andres@2ndquadrant.com
In reply to: Jon Nelson (#83)
Re: corrupt pages detected by enabling checksums

On 2013-05-13 08:45:41 -0500, Jon Nelson wrote:

On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:

On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file.

Perhaps we should also allow for zeroing out WAL files before reuse (or just
disable reuse). I know there's a performance hit there, but the reuse idea
happened before we had bgWriter. Theoretically the overhead creating a new
file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.

I don't think the comparison between just fallocate()ing and what we
currently do is fair. fallocate() doesn't guarantee that the file is the
same size after a crash, so you would still need an fsync() or we
couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
that big anymore in that case?

fallocate (16MB) + fsync is still almost certainly faster than
write+write+write... + fsync.
The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.

Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Greg Stark
stark@mit.edu
In reply to: Andres Freund (#84)
Re: corrupt pages detected by enabling checksums

On Mon, May 13, 2013 at 2:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

In ext4 aIui it doesn't actually pick target blocks. It just adjusts
the accounting so it knows that many blocks will be needed for this
file and guarantees they'll be available. If you read from them it
knows to provide 0s. So in theory the performance in the critical path
would be worse but I think by an insignificant amount.

The reason Postgres pre-allocates the blocks is not for the
performance optimization. It's for safety. To guarantee -- as best as
possible -- that it won't get a write error when the time comes to
write to it. Especially to guarantee that the disk won't suddenly turn
out to be full.

It seems possible that some file systems would not protect you against
media errors nearly as well using it. It might take time to respond to
a media error and in a poorly written filesystem it might even be
reported to the application even though there's no need. But media
errors can occur any time, even after the initial write so I don't
think this should be a blocker. I think posix_fallocate is good
enough for us and I would support using it.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Andres Freund
andres@2ndquadrant.com
In reply to: Greg Stark (#85)
Re: corrupt pages detected by enabling checksums

On 2013-05-13 16:03:11 +0100, Greg Stark wrote:

On Mon, May 13, 2013 at 2:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

In ext4 aIui it doesn't actually pick target blocks. It just adjusts
the accounting so it knows that many blocks will be needed for this
file and guarantees they'll be available. If you read from them it
knows to provide 0s. So in theory the performance in the critical path
would be worse but I think by an insignificant amount.

The reason Postgres pre-allocates the blocks is not for the
performance optimization. It's for safety. To guarantee -- as best as
possible -- that it won't get a write error when the time comes to
write to it. Especially to guarantee that the disk won't suddenly turn
out to be full.

posix_fallocate() guarantees that. And if you fsync() the file
afterwards its even supposed to still have enough space after the crash.

"After a successful call to posix_fallocate(), subsequent writes to
bytes in the specified range are guaranteed not to fail because of lack
of disk space."

It seems possible that some file systems would not protect you against
media errors nearly as well using it.

True. The same probably is true for modifications of existing files for
those fancy COW filesystems...

I think posix_fallocate is good enough for us and I would support
using it.

Me too, although this isn't the place where I'd be really excited to see
a patch implementing it properly ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Simon Riggs
simon@2ndQuadrant.com
In reply to: Jon Nelson (#83)
Re: corrupt pages detected by enabling checksums

On 13 May 2013 14:45, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

I should not derail this thread any further. Perhaps, if interested
parties would like to discuss the use of fallocate/posix_fallocate, a
new thread might be more appropriate?

Sounds like a good idea. Always nice to see a fresh take on earlier ideas.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers