Production block comparison facility

Started by Simon Riggsover 11 years ago17 messages

simon@2ndQuadrant.com

over 11 years ago

The block comparison facility presented earlier by Heikki would not be
able to be used in production systems. ISTM that it would be desirable
to have something that could be used in that way.

ISTM easy to make these changes

* optionally generate a FPW for every WAL record, not just first
change after checkpoint
full_page_writes = 'always'

* when an FPW arrives, optionally run a check to see if it compares
correctly against the page already there, when running streaming
replication without a recovery target. We could skip reporting any
problems until the database is consistent
full_page_write_check = on

The above changes seem easy to implement.

With FPW compression, this would be a usable feature in production.

Comments?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Simon Riggs (#1)

Re: Production block comparison facility

On Sun, Jul 20, 2014 at 5:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

The block comparison facility presented earlier by Heikki would not be
able to be used in production systems. ISTM that it would be desirable
to have something that could be used in that way.

ISTM easy to make these changes

* optionally generate a FPW for every WAL record, not just first
change after checkpoint
full_page_writes = 'always'

* when an FPW arrives, optionally run a check to see if it compares
correctly against the page already there, when running streaming
replication without a recovery target. We could skip reporting any
problems until the database is consistent
full_page_write_check = on

The above changes seem easy to implement.

With FPW compression, this would be a usable feature in production.

Comments?

This is an interesting idea, and it would be easier to use than what
has been submitted for CF1. However, full_page_writes set to "always"
would generate a large amount of WAL even for small records,
increasing I/O for the partition holding pg_xlog, and the frequency of
checkpoints run on system. Is this really something suitable for
production?
Then, looking at the code, we would need to tweak XLogInsert for the
WAL record construction to always do a FPW and to update
XLogCheckBufferNeedsBackup. Then for the redo part, we would need to
do some extra operations in the area of
RestoreBackupBlock/RestoreBackupBlockContents, including masking
operations before comparing the content of the FPW and the current
page.

Does that sound right?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Michael Paquier (#2)

Re: Production block comparison facility

On 22 July 2014 08:49, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sun, Jul 20, 2014 at 5:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

The block comparison facility presented earlier by Heikki would not be
able to be used in production systems. ISTM that it would be desirable
to have something that could be used in that way.

ISTM easy to make these changes

* optionally generate a FPW for every WAL record, not just first
change after checkpoint
full_page_writes = 'always'

* when an FPW arrives, optionally run a check to see if it compares
correctly against the page already there, when running streaming
replication without a recovery target. We could skip reporting any
problems until the database is consistent
full_page_write_check = on

The above changes seem easy to implement.

With FPW compression, this would be a usable feature in production.

Comments?

This is an interesting idea, and it would be easier to use than what
has been submitted for CF1. However, full_page_writes set to "always"
would generate a large amount of WAL even for small records,
increasing I/O for the partition holding pg_xlog, and the frequency of
checkpoints run on system. Is this really something suitable for
production?

For critical systems, yes, I think it is.

It would be possible to make that user selectable for particular
transactions or tables.

Then, looking at the code, we would need to tweak XLogInsert for the
WAL record construction to always do a FPW and to update
XLogCheckBufferNeedsBackup. Then for the redo part, we would need to
do some extra operations in the area of
RestoreBackupBlock/RestoreBackupBlockContents, including masking
operations before comparing the content of the FPW and the current
page.

Does that sound right?

Yes, it doesn't look very much code because it fits well with existing
approaches.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Greg Stark

stark@mit.edu

over 11 years ago

In reply to: Simon Riggs (#3)

Re: Production block comparison facility

If you're always going FPW then there's no point in the rest of the record.
The point here was to find problems so that users could run normally with
confidence.

The cases you might want to run in the mode you describe are the build farm
or integration testing. When treating your application on the next release
of postgres it would be nice to have tests for the replication in your
workload given the experience in 9.3.

Even without the constant full page writes a live production system could
do a FPW comparison after a FPW if it was in a consistent state. That would
give standbys periodic verification at low costs.

--
greg
On 22 Jul 2014 12:28, "Simon Riggs" <simon@2ndquadrant.com> wrote:

Show quoted text

On 22 July 2014 08:49, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sun, Jul 20, 2014 at 5:31 PM, Simon Riggs <simon@2ndquadrant.com>

wrote:

The block comparison facility presented earlier by Heikki would not be
able to be used in production systems. ISTM that it would be desirable
to have something that could be used in that way.

ISTM easy to make these changes

* optionally generate a FPW for every WAL record, not just first
change after checkpoint
full_page_writes = 'always'

* when an FPW arrives, optionally run a check to see if it compares
correctly against the page already there, when running streaming
replication without a recovery target. We could skip reporting any
problems until the database is consistent
full_page_write_check = on

The above changes seem easy to implement.

With FPW compression, this would be a usable feature in production.

Comments?

This is an interesting idea, and it would be easier to use than what
has been submitted for CF1. However, full_page_writes set to "always"
would generate a large amount of WAL even for small records,
increasing I/O for the partition holding pg_xlog, and the frequency of
checkpoints run on system. Is this really something suitable for
production?

For critical systems, yes, I think it is.

It would be possible to make that user selectable for particular
transactions or tables.

Then, looking at the code, we would need to tweak XLogInsert for the
WAL record construction to always do a FPW and to update
XLogCheckBufferNeedsBackup. Then for the redo part, we would need to
do some extra operations in the area of
RestoreBackupBlock/RestoreBackupBlockContents, including masking
operations before comparing the content of the FPW and the current
page.

Does that sound right?

Yes, it doesn't look very much code because it fits well with existing
approaches.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Greg Stark (#4)

Re: Production block comparison facility

On 22 July 2014 12:54, Greg Stark <stark@mit.edu> wrote:

If you're always going FPW then there's no point in the rest of the record.

I think its a simple matter to mark them XLP_BKP_REMOVABLE and to skip
any optimization of remainder of WAL records.

The point here was to find problems so that users could run normally with
confidence.

Yes, but a full overwrite mode would provide an even safer mode of operation.

The cases you might want to run in the mode you describe are the build farm
or integration testing. When treating your application on the next release
of postgres it would be nice to have tests for the replication in your
workload given the experience in 9.3.

Even without the constant full page writes a live production system could do
a FPW comparison after a FPW if it was in a consistent state. That would
give standbys periodic verification at low costs.

Yes, the two options I proposed are somewhat independent of each other.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#2)

1 attachment(s)

Re: Production block comparison facility

On Tue, Jul 22, 2014 at 4:49 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Then, looking at the code, we would need to tweak XLogInsert for the
WAL record construction to always do a FPW and to update
XLogCheckBufferNeedsBackup. Then for the redo part, we would need to
do some extra operations in the area of
RestoreBackupBlock/RestoreBackupBlockContents, including masking
operations before comparing the content of the FPW and the current
page.

Does that sound right?

I have spent some time digging more into this idea and finished with the
patch attached, doing the following: addition of a consistency check when
FPW is restored and applied on a given page.

The consistency check is made of two phases:
- Apply a mask on the FPW and the current page to eliminate potential
conflicts like hint bits for example.
- Check that the FPW is consistent with the current page, aka the current
page does not contain any new information that the FPW taken has not. This
is done by checking the masked portions of the FPW and the current page.
Also some more details:
- If an inconsistency is found, a WARNING is simply logged.
- The consistency check is done if current page is not empty, and if
database has reached a consistent state.
- The page masking API is taken from the WAL replay patch that was
submitted in CF1 and plugged in as an independent set of API.
- In masking stuff, to facilitate if a page is used by a sequence relation
SEQ_MAGIC as well as the its opaque data structure are renamed and moved
into sequence.h.
- To facilitate debugging and comparison, the masked FPW and current page
are also converted into hex.
Things could be refactored and improved for sure, but this patch is already
useful as-is so I am going to add it to the next commit fest.

Comments are welcome.
Regards,
--
Michael

Attachments:

0001-Add-facility-to-check-FPW-consistency-at-WAL-replay.patchtext/plain; charset=US-ASCII; name=0001-Add-facility-to-check-FPW-consistency-at-WAL-replay.patchDownload

From 5a05a3751ae278ba8eb7b79f50a4f7b652a1179c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 23 Jul 2014 23:11:10 +0900
Subject: [PATCH] Add  facility to check FPW consistency at WAL replay

The consistency check on FDW is made of two phases each time a FPW is
restored in place of an existing page::
- Apply a mask on the FPW and the current page to eliminate potential
conflicts like hint bits for example.
- Check that the FPW is consistent with the current page, aka the current
page does not contain any new information that the FPW taken has not.
This is done by checking the masked portions of the FPW and the current
page.
- If an inconsistency is found, a WARNING is simply logged.
This check is done only if current page is not empty and if database has
reached a consistent check.
---
 src/backend/access/transam/xlog.c    | 104 ++++++++++++
 src/backend/commands/sequence.c      |  34 ++--
 src/backend/storage/buffer/Makefile  |   2 +-
 src/backend/storage/buffer/bufmask.c | 301 +++++++++++++++++++++++++++++++++++
 src/include/commands/sequence.h      |  13 ++
 src/include/storage/bufmask.h        |  21 +++
 6 files changed, 452 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/storage/buffer/bufmask.c
 create mode 100644 src/include/storage/bufmask.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4f9595..a383126 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -47,6 +47,7 @@
 #include "replication/walsender.h"
 #include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/bufmask.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/large_object.h"
@@ -4065,6 +4066,109 @@ RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb, char *blk,
 
 	page = (Page) BufferGetPage(buffer);
 
+	/*
+	 * Before saving the new contents of the backup block ensure that
+	 * it is consistent with the existing one. Apply masking to it and
+	 * then perform a comparison with what is going to be added. Do
+	 * only this operation once a consistent state has been reached by
+	 * server and only if the page that is being rewritten is not empty
+	 * to have a clean base for comparison.
+	 */
+	if (reachedConsistency &&
+		!PageIsEmpty(page))
+	{
+		RelFileNode	rnode;
+		ForkNumber	forknum;
+		BlockNumber	blkno;
+		char *norm_new_page, *norm_old_page;
+		char *blk_data = (char *) palloc(BLCKSZ);
+		char	old_buf[BLCKSZ * 2];
+		char	new_buf[BLCKSZ * 2];
+		int		j = 0;
+		int		i;
+		bool	inconsistent = false;
+
+		/*
+		 * Block data needs to be reformated correctly, especially if
+		 * it has a hole. This is the same procssing as below but
+		 * replicating this code saves memcpy calls if consistency
+		 * checks cannot be done on this backup block.
+		 */
+		if (bkpb.hole_length == 0)
+		{
+			memcpy(blk_data, blk, BLCKSZ);
+		}
+		else
+		{
+			memcpy(blk_data, blk, bkpb.hole_offset);
+			/* must zero-fill the hole */
+			MemSet(blk_data + bkpb.hole_offset, 0, bkpb.hole_length);
+			memcpy(blk_data + (bkpb.hole_offset + bkpb.hole_length),
+				   blk + bkpb.hole_offset,
+				   BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
+		}
+
+		/* Mask pages */
+		BufferGetTag(buffer, &rnode, &forknum, &blkno);
+		norm_new_page = mask_page(blk_data, blkno);
+		norm_old_page = mask_page((char *) page, blkno);
+
+		/*
+		 * Convert the pages to be compared into hex format to facilitate
+		 * their comparison and make potential diffs more readable while
+		 * debugging.
+		 */
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte_new = (uint8) norm_new_page[i];
+			uint8 byte_old = (uint8) norm_old_page[i];
+
+			new_buf[j] = digits[byte_new >> 4];
+			old_buf[j] = digits[byte_old >> 4];
+			/*
+			 * Do an inclusive comparison, if the new buffer has a mask
+			 * marker and not the old buffer pages are inconsistent as this
+			 * would mean that the old page has content that the new buffer
+			 * has not.
+			 */
+			if (new_buf[j] == 0x0F && old_buf[j] != 0x0F)
+			{
+				inconsistent = true;
+				break;
+			}
+			j++;
+			new_buf[j] = digits[byte_new & 0x0F];
+			old_buf[j] = digits[byte_old & 0x0F];
+			if (new_buf[j] == 0x0F && old_buf[j] != 0x0F)
+			{
+				inconsistent = true;
+				break;
+			}
+			j++;
+		}
+
+		/* Time to compare the old and new contents */
+		if (inconsistent)
+			elog(WARNING,
+				 "Inconsistent pages found for record %X/%X, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 (uint32) (lsn >> 32), (uint32) lsn,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent pages found for record %X/%X, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 (uint32) (lsn >> 32), (uint32) lsn,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+
+		pfree(blk_data);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+	}
+
 	if (bkpb.hole_length == 0)
 	{
 		memcpy((char *) page, blk, BLCKSZ);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..802aac7 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -46,16 +46,6 @@
 #define SEQ_LOG_VALS	32
 
 /*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
-typedef struct sequence_magic
-{
-	uint32		magic;
-} sequence_magic;
-
-/*
  * We store a SeqTable item for every sequence we have touched in the current
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
@@ -306,7 +296,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 {
 	Buffer		buf;
 	Page		page;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	OffsetNumber offnum;
 
 	/* Initialize first page of relation with special magic number */
@@ -316,9 +306,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	page = BufferGetPage(buf);
 
-	PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
-	sm->magic = SEQ_MAGIC;
+	PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	/* Now insert sequence tuple */
 
@@ -1066,18 +1056,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
 {
 	Page		page;
 	ItemId		lp;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	Form_pg_sequence seq;
 
 	*buf = ReadBuffer(rel, 0);
 	LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buf);
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
 
-	if (sm->magic != SEQ_MAGIC)
+	if (sm->seq_page_id != SEQ_MAGIC)
 		elog(ERROR, "bad magic number in sequence \"%s\": %08X",
-			 RelationGetRelationName(rel), sm->magic);
+			 RelationGetRelationName(rel), sm->seq_page_id);
 
 	lp = PageGetItemId(page, FirstOffsetNumber);
 	Assert(ItemIdIsNormal(lp));
@@ -1541,7 +1531,7 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	char	   *item;
 	Size		itemsz;
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 
 	/* Backup blocks are not used in seq records */
 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
@@ -1564,9 +1554,9 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	 */
 	localpage = (Page) palloc(BufferGetPageSize(buffer));
 
-	PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(localpage);
-	sm->magic = SEQ_MAGIC;
+	PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	item = (char *) xlrec + sizeof(xl_seq_rec);
 	itemsz = record->xl_len - sizeof(xl_seq_rec);
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..7e0077c
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,301 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufmask.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+static void mask_unused_space(Page page);
+static void mask_heap_page(Page page);
+static void mask_spgist_page(Page page);
+static void mask_gist_page(Page page);
+static void mask_gin_page(Page page, BlockNumber blkno);
+static void mask_sequence_page(Page page);
+static void mask_btree_page(Page page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int	pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+			 ((PageHeader) page)->pd_lsn.xlogid,
+			 ((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(Page page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(Page page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(Page page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(Page page, BlockNumber blkno)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(Page page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(Page page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+			(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/*
+ * mask_page
+ *
+ * Mask a given page. First try to find what kind of page it is
+ * and then normalize it. This function returns a normalized page
+ * palloc'ed. So caller should free the normalized page correctly when
+ * using this function. Tracking blkno is needed for gin pages as their
+ * metapage does not use pd_lower and pd_upper.
+ * Before calling this function, it is assumed that caller has already
+ * taken a proper lock on the page being masked.
+ */
+char *
+mask_page(const char *page, BlockNumber blkno)
+{
+	Page	page_norm;
+	uint16	tail;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		/* Case of a normal relation, it has an empty special area */
+		mask_heap_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) &&
+			 tail == GIST_PAGE_ID)
+	{
+		/* Gist page */
+		mask_gist_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) &&
+			 tail <= MAX_BT_CYCLE_ID)
+	{
+		/* btree page */
+		mask_btree_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) &&
+			 tail == SPGIST_PAGE_ID)
+	{
+		/* SpGist page */
+		mask_spgist_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(SequencePageOpaqueData)))
+	{
+		/*
+		 * The page found here is used either for a Gin index or a sequence.
+		 * Gin index pages do not have a proper identifier, so check if the page
+		 * is used by a sequence or not. If it is not the case, this page is used
+		 * by a gin index. It is still possible that a gin page covers with area
+		 * with exactly the same value as SEQ_MAGIC, but this is unlikely to happen.
+		 */
+		if (((SequencePageOpaqueData *) PageGetSpecialPointer(page))->seq_page_id == SEQ_MAGIC)
+			mask_sequence_page(page_norm);
+		else
+			mask_gin_page(page_norm, blkno);
+	}
+	else
+	{
+		/* Should not come here */
+		Assert(0);
+	}
+
+	/* Return normalized page */
+	return (char *) page_norm;
+}
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 8819c00..2455878 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,19 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * Page opaque data in a sequence page
+ */
+typedef struct SequencePageOpaqueData
+{
+	uint32 seq_page_id;
+} SequencePageOpaqueData;
+
+/*
+ * This page ID is for the conveniende to be able to identify if a page
+ * is being used by a sequence.
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..1dd5a67
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+
+/* Entry point for page masking */
+extern char *mask_page(const char *page, BlockNumber blkno);
+
+#endif
-- 
2.0.2

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Michael Paquier (#6)

Re: Production block comparison facility

On 23 July 2014 15:14, Michael Paquier <michael.paquier@gmail.com> wrote:

I have spent some time digging more into this idea and finished with the
patch attached

Thank you for investigating the idea. I'll review by Monday.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Simon Riggs (#7)

Re: Production block comparison facility

On Thu, Jul 24, 2014 at 12:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 23 July 2014 15:14, Michael Paquier <michael.paquier@gmail.com> wrote:

I have spent some time digging more into this idea and finished with the
patch attached

Thank you for investigating the idea. I'll review by Monday.

OK, thanks. Here are a couple of things that are not really necessary
for the feature but I did to facilitate tests with the patch as well
as its review:
- Some information is logged to the user as DEBUG1 even if the current
page and FDW are consistent. It may be better removed.
- FPW/page consistency check is done after converting them to hex.
This is done only this way to facilitate viewing the page diffs with a
debugger. A best method would be to perform the checks using
MASK_MARKER (which should be moved to bufmask.h btw). It may be better
to put all this hex magic within a WAL_DEBUG ifdef.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Michael Paquier (#8)

Re: Production block comparison facility

On 2014-07-24 20:35:04 +0900, Michael Paquier wrote:

- FPW/page consistency check is done after converting them to hex.
This is done only this way to facilitate viewing the page diffs with a
debugger. A best method would be to perform the checks using
MASK_MARKER (which should be moved to bufmask.h btw). It may be better
to put all this hex magic within a WAL_DEBUG ifdef.

Can't you just do "p/x whatever" in the debugger to display things in
hex?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Andres Freund (#9)

Re: Production block comparison facility

On Thu, Jul 24, 2014 at 8:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-07-24 20:35:04 +0900, Michael Paquier wrote:

- FPW/page consistency check is done after converting them to hex.
This is done only this way to facilitate viewing the page diffs with a
debugger. A best method would be to perform the checks using
MASK_MARKER (which should be moved to bufmask.h btw). It may be better
to put all this hex magic within a WAL_DEBUG ifdef.

Can't you just do "p/x whatever" in the debugger to display things in
hex?

Well yes :p
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Michael Paquier (#6)

Re: Production block comparison facility

On 07/23/2014 05:14 PM, Michael Paquier wrote:

On Tue, Jul 22, 2014 at 4:49 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Then, looking at the code, we would need to tweak XLogInsert for the
WAL record construction to always do a FPW and to update
XLogCheckBufferNeedsBackup. Then for the redo part, we would need to
do some extra operations in the area of
RestoreBackupBlock/RestoreBackupBlockContents, including masking
operations before comparing the content of the FPW and the current
page.

Does that sound right?

I have spent some time digging more into this idea and finished with the
patch attached, doing the following: addition of a consistency check when
FPW is restored and applied on a given page.

The consistency check is made of two phases:
- Apply a mask on the FPW and the current page to eliminate potential
conflicts like hint bits for example.
- Check that the FPW is consistent with the current page, aka the current
page does not contain any new information that the FPW taken has not. This
is done by checking the masked portions of the FPW and the current page.
Also some more details:
- If an inconsistency is found, a WARNING is simply logged.
- The consistency check is done if current page is not empty, and if
database has reached a consistent state.
- The page masking API is taken from the WAL replay patch that was
submitted in CF1 and plugged in as an independent set of API.
- In masking stuff, to facilitate if a page is used by a sequence relation
SEQ_MAGIC as well as the its opaque data structure are renamed and moved
into sequence.h.
- To facilitate debugging and comparison, the masked FPW and current page
are also converted into hex.
Things could be refactored and improved for sure, but this patch is already
useful as-is so I am going to add it to the next commit fest.

I don't understand how this works. A full-page image contains the new
page contents *after* the WAL-logged operation. For example, in a heap
insert, the full-page image contains the new tuple. How can you compare
that with what's on the disk already?

ISTM you'd need to log two full-page images for every WAL record. A
before image and an after image. Then you could do a lot of checking:

1. the before image should match what's on disk already
2. the result after applying the WAL record should match the after image.

That would be more handy than the approach I used, where the page images
are logged to a separate file. You wouldn't need to deal with any new
files, as all the data is in the WAL. Verification would be done
directly in the standby, with no need to run any extra programs.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#11)

Re: Production block comparison facility

On Tue, Jul 29, 2014 at 7:30 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I don't understand how this works. A full-page image contains the new page
contents *after* the WAL-logged operation. For example, in a heap insert,
the full-page image contains the new tuple. How can you compare that with
what's on the disk already?

An exact match of the FPW and the current page is not done, the patch
as it stands now checks if a FPW is consistent with the content of
current page by checking if it does not include changes that diverge
from what the FPW has.
For example for a heap insert, if current page has N records
pointer1/tup1..pointerN/tupN, FPW should only contain (N+1) records
pointer1/tup1..pointer(N+1)/tup(N+1). After applying the mask at block
recovery, process simply checks that the FPW and current page contain
the first N records, marking FPW and current page as inconsistent if
the current page has some garbage like some extra tuple entries not in
the FPW. I am sure you have arguments against that though...

ISTM you'd need to log two full-page images for every WAL record. A before
image and an after image.

The after image is the current FPW, so there is nothing else to do for
it. But for the before buffer, what do you think about using
ReadBufferExtended with RBM_NORMAL? We could grab its content from
disk in XLogInsert only when we are sure that a backup block is added.

Then you could do a lot of checking:
1. the before image should match what's on disk already
2. the result after applying the WAL record should match the after image.

A WAL record can contain up to XLR_MAX_BKP_BLOCKS backup blocks.
Should we double it from 4 to 8?

That would be more handy than the approach I used, where the page images are
logged to a separate file. You wouldn't need to deal with any new files, as
all the data is in the WAL. Verification would be done directly in the
standby, with no need to run any extra programs.

In this case, would it better to control that with a GUC? Making that
the default will increase the amount of WAL for all types of
applications, except if couple with FPW compression...
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Heikki Linnakangas (#11)

Re: Production block comparison facility

On 29 July 2014 11:30, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

I don't understand how this works. A full-page image contains the new page
contents *after* the WAL-logged operation. For example, in a heap insert,
the full-page image contains the new tuple. How can you compare that with
what's on the disk already?

ISTM you'd need to log two full-page images for every WAL record. A before
image and an after image. Then you could do a lot of checking:

1. the before image should match what's on disk already
2. the result after applying the WAL record should match the after image.

That would be more handy than the approach I used, where the page images are
logged to a separate file. You wouldn't need to deal with any new files, as
all the data is in the WAL. Verification would be done directly in the
standby, with no need to run any extra programs.

It doesn't matter whether we take a before or after image of the page.

What is important is that we make the check on the standby at the same
point as the full page was taken on the master. After all, the pages
are marked as removable.

Given the pages are after images, then we just make the check after
applying WAL.

So I don't see the need for two full page images.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Simon Riggs (#13)

Re: Production block comparison facility

On Thu, Jul 31, 2014 at 2:59 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 29 July 2014 11:30, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

I don't understand how this works. A full-page image contains the new page
contents *after* the WAL-logged operation. For example, in a heap insert,
the full-page image contains the new tuple. How can you compare that with
what's on the disk already?

ISTM you'd need to log two full-page images for every WAL record. A before
image and an after image. Then you could do a lot of checking:

1. the before image should match what's on disk already
2. the result after applying the WAL record should match the after image.

That would be more handy than the approach I used, where the page images are
logged to a separate file. You wouldn't need to deal with any new files, as
all the data is in the WAL. Verification would be done directly in the
standby, with no need to run any extra programs.

It doesn't matter whether we take a before or after image of the page.

What is important is that we make the check on the standby at the same
point as the full page was taken on the master. After all, the pages
are marked as removable.

Given the pages are after images, then we just make the check after
applying WAL.

So I don't see the need for two full page images.

By doing so you definitely need an additional mode for full-page
writes: one certifying that process does not apply this FPW because it
wants to compare it to current page after applying the WALs. This
increases the footprint of the feature on code because all the code
paths where RestoreBackupBlock is called need to be bypassed.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Michael Paquier (#14)

Re: Production block comparison facility

On 31 July 2014 07:45, Michael Paquier <michael.paquier@gmail.com> wrote:

So I don't see the need for two full page images.

By doing so you definitely need an additional mode for full-page
writes: one certifying that process does not apply this FPW because it
wants to compare it to current page after applying the WALs. This
increases the footprint of the feature on code because all the code
paths where RestoreBackupBlock is called need to be bypassed.

Yeh, it looks like you need to do CheckBackupBlock() exactly as many
times as you do RestoreBackupBlock(), with the sequence of actions
being RestoreBackupBlock(), apply WAL then CheckBackupBlock(). That
will work without much code churn, it will be just a one line addition
in a few dozen places.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Simon Riggs (#15)

Re: Production block comparison facility

On Thu, Jul 31, 2014 at 4:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yeh, it looks like you need to do CheckBackupBlock() exactly as many
times as you do RestoreBackupBlock(), with the sequence of actions
being RestoreBackupBlock(), apply WAL then CheckBackupBlock(). That
will work without much code churn, it will be just a one line addition
in a few dozen places.

Additionally, as this is a recovery-only feature, I was thinking that
it would be better to control it with a parameter of recovery.conf.
Let's call it check_full_page_writes for example. Thoughts?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#6)

Re: Production block comparison facility

On Wed, Jul 23, 2014 at 11:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Things could be refactored and improved for sure, but this patch is already
useful as-is so I am going to add it to the next commit fest.

After some more investigation, I am going to mark this patch as
"Returned with feedback" for the time being (mainly to let it show up
on the commit fest app and for the sake of archives), Mainly for two
reasons:
- We can do better than what I sent: instead of checking if the FPW
and the current page are somewhat consistent, we could actually check
if the current page is equal with the FPW after applying WAL on it. In
order to do that, we would need to bypass the FPW replay and to apply
WAL on the current page (if the page is already initialized), then
control RestoreBackupBlock (or its equivalent) that with an additional
flag to tell that block is "not restored, but can get WAL applied to
it safely". Then a comparison with the FPW contained in the WAL record
can be made.
- The patch of Heikki to change the WAL APIs and track more easily the
blocks changes is going to make this implementation far easier. It
also improves the status checks on which block has been restored, so
it is more easily extensible for what could be done here.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers