WAL consistency check facility

Started by Kuntal Ghoshover 9 years ago124 messages
#1Kuntal Ghosh
kuntalghosh.2007@gmail.com
1 attachment(s)

Hi,

I've attached a patch to check if the current page is equal with the
FPW after applying WAL on it. This is how the patch works:

1. When a WAL record is inserted, a FPW is done for that operation.
But, a flag is kept to indicate whether that page needs to be
restored.
2. During recovery, when a redo operation is done, we do a comparison
with the FPW contained in the WAL record with the current page in the
buffer. For this purpose, I've used Michael's patch with minor changes
to check whether two pages are actually equal or not.
3. I've also added a guc variable (wal_consistency_mask) to indicate
the operations (HEAP,BTREE,HASH,GIN etc) for which this feature
(always FPW and consistency check) is to be enabled.

How to use the patch:
1. Apply the patch.
2. In postgresql.conf file, set wal_consistency_mask variable
accordingly. For debug messages, set log_min_messages = debug1.

Michael's patch:
/messages/by-id/CAB7nPqR4vxdKijP+Du82vOcOnGMvutq-gfqiU2dsH4bsM77hYg@mail.gmail.com

Reference thread:
/messages/by-id/CAB7nPqR4vxdKijP+Du82vOcOnGMvutq-gfqiU2dsH4bsM77hYg@mail.gmail.com

Please let me know your thoughts on this.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v3.patchtext/x-patch; charset=US-ASCII; name=walconsistency_v3.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..9380079 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -25,6 +25,7 @@
 #include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
+#include "access/rmgr.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
@@ -52,7 +53,9 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/barrier.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
+#include "storage/bufpage.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/large_object.h"
@@ -94,6 +97,7 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+int		wal_consistency_mask = 0;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -867,6 +871,9 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+void checkWALConsistency(XLogReaderState *xlogreader);
+void checkWALConsistencyForBlock(XLogReaderState *record, uint8 block_id);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6868,6 +6875,12 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * Check whether the page associated with WAL record is consistent
+				 * with the existing page
+				 */
+				checkWALConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -11626,3 +11639,160 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Check whether the page associated with WAL record is consistent with the
+ * existing page or not.
+ */
+void checkWALConsistency(XLogReaderState *xlogreader)
+{
+	RmgrIds rmid = (RmgrIds) XLogRecGetRmid(xlogreader);
+	int block_id;
+	int enableWALConsistencyMask = 1;
+	RmgrIds rmids[] = {RM_HEAP2_ID,RM_HEAP_ID,RM_BTREE_ID,RM_HASH_ID,RM_GIN_ID,RM_GIST_ID,RM_SEQ_ID,RM_SPGIST_ID,RM_BRIN_ID};
+	int size = sizeof(rmids)/sizeof(rmid);
+	int i;
+	for(i=0;i<size;i++)
+	{
+		if(rmids[i]==rmid && (wal_consistency_mask & enableWALConsistencyMask))
+		{
+			for (block_id = 0; block_id <= xlogreader->max_block_id; block_id++)
+				checkWALConsistencyForBlock(xlogreader,block_id);
+			break;
+		}
+		/*
+		 * Enable checking for the next bit
+		 */
+		enableWALConsistencyMask <<= 1;
+	}
+}
+void checkWALConsistencyForBlock(XLogReaderState *record, uint8 block_id)
+{
+	Buffer buf;
+	char *ptr;
+	DecodedBkpBlock *bkpb;
+	char		tmp[BLCKSZ];
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	BlockNumber blkno;
+	Page		page;
+
+	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	{
+		/* Caller specified a bogus block_id. Don't do anything. */
+		return;
+	}
+	buf = XLogReadBufferExtended(rnode, forknum, blkno,
+									   RBM_WAL_CHECK);
+	page = BufferGetPage(buf);
+
+	bkpb = &record->blocks[block_id];
+	if(bkpb->bkp_image!=NULL)
+		ptr = bkpb->bkp_image;
+	else
+	{
+		elog(WARNING,
+				 "No page found in WAL for record %X/%X, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 (uint32) (record->ReadRecPtr>> 32), (uint32) record->ReadRecPtr ,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		return;
+	}
+
+	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
+	{
+		/* If a backup block image is compressed, decompress it */
+		if (pglz_decompress(ptr, bkpb->bimg_len, tmp,
+							BLCKSZ - bkpb->hole_length) < 0)
+		{
+			elog(ERROR, "invalid compressed image at %X/%X, block %d",
+								  (uint32) (record->ReadRecPtr >> 32),
+								  (uint32) record->ReadRecPtr,
+								  block_id);
+		}
+		ptr = tmp;
+	}
+	/*
+	 * If block restores the associated page during WAL replay,
+	 * adjust the block hole accordingly.
+	 */
+	if (bkpb->hole_length == 0)
+	{
+		memcpy(tmp, ptr, BLCKSZ);
+	}
+	else
+	{
+		memcpy(tmp, ptr, bkpb->hole_offset);
+		/* must zero-fill the hole */
+		MemSet(tmp + bkpb->hole_offset, 0, bkpb->hole_length);
+		memcpy(tmp + (bkpb->hole_offset + bkpb->hole_length),
+			ptr + bkpb->hole_offset,
+			BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
+	}
+	ptr = tmp;
+	char *norm_new_page, *norm_old_page;
+	char	old_buf[BLCKSZ * 2];
+	char	new_buf[BLCKSZ * 2];
+	int		j = 0;
+	int		i;
+	bool	inconsistent = false;
+
+	/* Mask pages */
+	norm_new_page = mask_page((Page)ptr, blkno);
+	norm_old_page = mask_page((Page)page, blkno);
+	/*
+	 * Convert the pages to be compared into hex format to facilitate
+	 * their comparison and make potential diffs more readable while
+	 * debugging.
+	 */
+	for (i = 0; i < BLCKSZ; i++)
+	{
+		const char *digits = "0123456789ABCDEF";
+		uint8 byte_new = (uint8) norm_new_page[i];
+		uint8 byte_old = (uint8) norm_old_page[i];
+
+		new_buf[j] = digits[byte_new >> 4];
+		old_buf[j] = digits[byte_old >> 4];
+		/*
+		 * Do an inclusive comparison, if the new buffer has a mask
+		 * marker and not the old buffer pages are inconsistent as this
+		 * would mean that the old page has content that the new buffer
+		 * has not.
+		 */
+		if (new_buf[j]!=old_buf[j])
+		{
+			inconsistent = true;
+			break;
+		}
+		j++;
+		new_buf[j] = digits[byte_new & 0x0F];
+		old_buf[j] = digits[byte_old & 0x0F];
+		if (new_buf[j]!=old_buf[j])
+		{
+			inconsistent = true;
+			break;
+		}
+		j++;
+	}
+
+	/* Time to compare the old and new contents */
+	if (inconsistent)
+		elog(WARNING,
+			 "Inconsistent page (at byte %u) found for record %X/%X, rel %u/%u/%u, "
+			 "forknum %u, blkno %u", i,
+			 (uint32) (record->ReadRecPtr>> 32), (uint32) record->ReadRecPtr ,
+			 rnode.spcNode, rnode.dbNode, rnode.relNode,
+			 forknum, blkno);
+	else
+		elog(DEBUG1,
+			 "Consistent page found for record %X/%X, rel %u/%u/%u, "
+			 "forknum %u, blkno %u",
+			 (uint32) (record->ReadRecPtr  >> 32), (uint32) record->ReadRecPtr ,
+			 rnode.spcNode, rnode.dbNode, rnode.relNode,
+			 forknum, blkno);
+
+	pfree(norm_new_page);
+	pfree(norm_old_page);
+	ReleaseBuffer(buf);
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c37003a..5ff41a3 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -513,7 +513,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
-
+		int		enableWALConsistencyMask = 1;
+		RmgrIds		rmids[] = {RM_HEAP2_ID,RM_HEAP_ID,RM_BTREE_ID,RM_HASH_ID,RM_GIN_ID,RM_GIST_ID,RM_SEQ_ID,RM_SPGIST_ID,RM_BRIN_ID};
+		int		size = sizeof(rmids)/sizeof(rmid);
+		int		i;
+		bool		needs_image_backup; /*Since, we always set needs_backup to true,
+							this field remembers the original value of needs_backup*/
 		if (!regbuf->in_use)
 			continue;
 
@@ -556,6 +561,24 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
+		/*
+		 * If wal consistency check is enabled for current rmid,
+		 * We do fpw for the current block.
+		 */
+		needs_image_backup = needs_backup;
+		for(i=0;i<size;i++)
+		{
+			if(rmids[i]==rmid && (wal_consistency_mask & enableWALConsistencyMask))
+			{
+				needs_backup = true;
+				break;
+			}
+			/*
+			 * Enable checking for the next bit
+			 */
+			enableWALConsistencyMask <<= 1;
+		}
+
 		if (needs_backup)
 		{
 			Page		page = regbuf->page;
@@ -618,6 +641,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			if (needs_image_backup)
+				bimg.bimg_info |= BKPIMAGE_IS_REQUIRED;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index dcf747c..5e53df3 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1077,11 +1077,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			}
 			datatotal += blk->data_len;
 
+			blk->require_image=false;
 			if (blk->has_image)
 			{
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+				/*
+				 * If we require the image for any other purpose that wal consistency check
+				 * set require_image flag.
+				 */
+				if(blk->bimg_info & BKPIMAGE_IS_REQUIRED)
+					blk->require_image = true;
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1222,6 +1229,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
 		}
+		/*
+		 * If image is inserted in the WAL record for any other purpose than WAL
+		 * consistency check, set has_image=true, else set it to false.
+		 */
+		blk->has_image=blk->require_image;
 	}
 
 	/* and finally, the main data */
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index c98f981..eaf2d8b 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -49,16 +49,6 @@
 #define SEQ_LOG_VALS	32
 
 /*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
-typedef struct sequence_magic
-{
-	uint32		magic;
-} sequence_magic;
-
-/*
  * We store a SeqTable item for every sequence we have touched in the current
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
@@ -329,7 +319,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 {
 	Buffer		buf;
 	Page		page;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	OffsetNumber offnum;
 
 	/* Initialize first page of relation with special magic number */
@@ -339,9 +329,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	page = BufferGetPage(buf);
 
-	PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
-	sm->magic = SEQ_MAGIC;
+	PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	/* Now insert sequence tuple */
 
@@ -1109,18 +1099,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
 {
 	Page		page;
 	ItemId		lp;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	Form_pg_sequence seq;
 
 	*buf = ReadBuffer(rel, 0);
 	LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buf);
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
 
-	if (sm->magic != SEQ_MAGIC)
+	if (sm->seq_page_id != SEQ_MAGIC)
 		elog(ERROR, "bad magic number in sequence \"%s\": %08X",
-			 RelationGetRelationName(rel), sm->magic);
+			 RelationGetRelationName(rel), sm->seq_page_id);
 
 	lp = PageGetItemId(page, FirstOffsetNumber);
 	Assert(ItemIdIsNormal(lp));
@@ -1585,7 +1575,7 @@ seq_redo(XLogReaderState *record)
 	char	   *item;
 	Size		itemsz;
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 
 	if (info != XLOG_SEQ_LOG)
 		elog(PANIC, "seq_redo: unknown op code %u", info);
@@ -1604,9 +1594,9 @@ seq_redo(XLogReaderState *record)
 	 */
 	localpage = (Page) palloc(BufferGetPageSize(buffer));
 
-	PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(localpage);
-	sm->magic = SEQ_MAGIC;
+	PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	item = (char *) xlrec + sizeof(xl_seq_rec);
 	itemsz = XLogRecGetDataLen(record) - sizeof(xl_seq_rec);
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..99c0e15
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,372 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/hash.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufmask.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+static void mask_unused_space(Page page);
+static void mask_page_lsn(Page page);
+static void mask_heap_page(Page page);
+static void mask_spgist_page(Page page);
+static void mask_gist_page(Page page);
+static void mask_gin_page(Page page, BlockNumber blkno);
+static void mask_sequence_page(Page page);
+static void mask_btree_page(Page page);
+static void mask_hash_page(Page page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int	pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+			 ((PageHeader) page)->pd_lsn.xlogid,
+			 ((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+/*
+ * Mask Page LSN
+ */
+static void
+mask_page_lsn(Page page)
+{
+
+	PageHeader phdr = (PageHeader) page;
+	PageXLogRecPtrSet(phdr->pd_lsn,0);
+}
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(Page page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(Page page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(Page page)
+{
+	mask_unused_space(page);
+
+	/*Mask NSN*/
+	GistPageSetNSN(page, 0);
+	/* Mask flag bits of a gist page*/
+	GistPageSetDeleted(page);
+	GistMarkTuplesDeleted(page);
+	GistMarkPageHasGarbage(page);
+	GistMarkFollowRight(page);
+}
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(Page page, BlockNumber blkno)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(Page page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(Page page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+			(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+static void
+mask_hash_page(Page page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	HashPageOpaque opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+	/*
+	 * Mask everything on a UNUSED page.
+	 */
+	if (opaque->hasho_flag & LH_UNUSED_PAGE)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else if ((opaque->hasho_flag & LH_META_PAGE)==0)
+	{
+		/*
+		 * For pages other than metapage,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+}
+/*
+ * mask_page
+ *
+ * Mask a given page. First try to find what kind of page it is
+ * and then normalize it. This function returns a normalized page
+ * palloc'ed. So caller should free the normalized page correctly when
+ * using this function. Tracking blkno is needed for gin pages as their
+ * metapage does not use pd_lower and pd_upper.
+ * Before calling this function, it is assumed that caller has already
+ * taken a proper lock on the page being masked.
+ */
+char *
+mask_page(const char *page, BlockNumber blkno)
+{
+	Page	page_norm;
+	uint16	tail;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		/* Case of a normal relation, it has an empty special area */
+		mask_heap_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) &&
+			 tail == GIST_PAGE_ID)
+	{
+		/* Gist page */
+		mask_gist_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) &&
+			 tail <= MAX_BT_CYCLE_ID)
+	{
+		/* btree page */
+		mask_btree_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) &&
+			 tail == SPGIST_PAGE_ID)
+	{
+		/* SpGist page */
+		mask_spgist_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(SequencePageOpaqueData)))
+	{
+		/*
+		 * The page found here is used either for a Gin index or a sequence.
+		 * Gin index pages do not have a proper identifier, so check if the page
+		 * is used by a sequence or not. If it is not the case, this page is used
+		 * by a gin index. It is still possible that a gin page covers with area
+		 * with exactly the same value as SEQ_MAGIC, but this is unlikely to happen.
+		 */
+		if (((SequencePageOpaqueData *) PageGetSpecialPointer(page))->seq_page_id == SEQ_MAGIC)
+			mask_sequence_page(page_norm);
+		else
+			mask_gin_page(page_norm, blkno);
+	}
+	else if(PageGetSpecialSize(page) == MAXALIGN(sizeof(HashPageOpaqueData)))
+	{
+		mask_hash_page(page_norm);
+	}
+	else
+	{
+		/* Should not come here except BRIN pages*/
+		Assert(0);
+	}
+
+	/* Return normalized page */
+	return (char *) page_norm;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6ac5184..645a807 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1800,6 +1800,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"wal_consistency_mask", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Mask to enable WAL consistency for HEAP_INSERT/HEAP_INSERT2."),
+			NULL
+		},
+		&wal_consistency_mask,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum wait time to receive data from the primary."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6d0666c..e7e21ed 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,17 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency_mask = 0		# enable WAL consistency check for different operations
+					# bit 0 - HEAP2
+					# bit 1 - HEAP
+					# bit 2 - BTREE
+					# bit 3 - HASH
+					# bit 4 - GIN
+					# bit 5 - GIST
+					# bit 6 - SEQ
+					# bit 7 - SPGIST
+					# bit 8 - BRIN
+					# Multiple bits can also be enabled. For example, to enable HEAP and HASH, set the value to 10
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 14b7f7f..1fc5f6e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -104,6 +104,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern int wal_consistency_mask;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..287143b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,8 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		require_image; /* This field contains the true value of has_image.
+					Because, if wal consistency check is enabled, has_image will always be true.*/
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..34e28c0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -137,7 +137,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
-
+#define BKPIMAGE_IS_REQUIRED		0x04	/* page is required by the WAL record */
 /*
  * Extra header information used when page image has "hole" and
  * is compressed.
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 6af60d8..a7a0e16 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -20,6 +20,19 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * Page opaque data in a sequence page
+ */
+typedef struct SequencePageOpaqueData
+{
+	uint32 seq_page_id;
+} SequencePageOpaqueData;
+
+/*
+ * This page ID is for the conveniende to be able to identify if a page
+ * is being used by a sequence.
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..1dd5a67
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+
+/* Entry point for page masking */
+extern char *mask_page(const char *page, BlockNumber blkno);
+
+#endif
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..6621a39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -44,8 +44,9 @@ typedef enum
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
-	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
+	RBM_NORMAL_NO_LOG,			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+	RBM_WAL_CHECK		/*Normal read, but don't check whether the page is new or not. */
 } ReadBufferMode;
 
 /* forward declared, to avoid having to expose buf_internals.h here */
#2Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#1)
Re: WAL consistency check facility

On Mon, Aug 22, 2016 at 9:44 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

Please let me know your thoughts on this.

Since custom AMs have been introduced, I have kept that in a corner of
my mind and thought about it a bit. And while the goal of this patch
is clearly worth it, I don't think that the page masking interface is
clear at all. For example, your patch completely ignores
contrib/bloom, and we surely want to do something about it. The idea
would be to add a page masking routine in IndexAmRoutine and heap to
be able to perform page masking operations directly with that. This
would allow as well one to be able to perform page masking for bloom
or any custom access method, and this will allow this sanity check to
be more generic as well.

Another pin-point is: given a certain page, how do we identify of
which type it is? One possibility would be again to extend the AM
handler with some kind of is_self function with a prototype like that:
bool handler->is_self(Page);
If the page is of the type of the handler, this returns true, and
false otherwise. Still here performance would suck.

At the end, what we want is a clean interface, and more thoughts into it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#2)
Re: WAL consistency check facility

On Mon, Aug 22, 2016 at 9:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Another pin-point is: given a certain page, how do we identify of
which type it is? One possibility would be again to extend the AM
handler with some kind of is_self function with a prototype like that:
bool handler->is_self(Page);
If the page is of the type of the handler, this returns true, and
false otherwise. Still here performance would suck.

At the end, what we want is a clean interface, and more thoughts into it.

I think that it makes sense to filter based on the resource manager
ID, but I can't see how we could reasonably filter based on the AM
name.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Simon Riggs
simon@2ndquadrant.com
In reply to: Kuntal Ghosh (#1)
Re: WAL consistency check facility

On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Please let me know your thoughts on this.

Do the regression tests pass with this option enabled?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#3)
Re: WAL consistency check facility

On Mon, Aug 22, 2016 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 22, 2016 at 9:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Another pin-point is: given a certain page, how do we identify of
which type it is? One possibility would be again to extend the AM
handler with some kind of is_self function with a prototype like that:
bool handler->is_self(Page);
If the page is of the type of the handler, this returns true, and
false otherwise. Still here performance would suck.

At the end, what we want is a clean interface, and more thoughts into it.

I think that it makes sense to filter based on the resource manager
ID

+1.

I think the patch currently addresses only a subset of resource
manager id's (mainly Heap and Index resource manager id's). Do we
want to handle the complete resource manager list as defined in
rmgrlist.h?

Another thing that needs some thoughts is the UI of this patch,
currently it is using integer mask which might not be best way, but
again as it is intended to be mainly used for tests, it might be okay.

Do we want to enable some tests in the regression suite by using this option?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Simon Riggs (#4)
Re: WAL consistency check facility

Yes, I've verified the outputs and log contents after running gmake
installcheck and gmake installcheck-world. The status of the test was
marked as pass for all the testcases.

On Mon, Aug 22, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Please let me know your thoughts on this.

Do the regression tests pass with this option enabled?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Thanks & Regards,
Kuntal Ghosh

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Michael Paquier
michael.paquier@gmail.com
In reply to: Amit Kapila (#5)
Re: WAL consistency check facility

On Tue, Aug 23, 2016 at 1:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 22, 2016 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 22, 2016 at 9:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Another pin-point is: given a certain page, how do we identify of
which type it is? One possibility would be again to extend the AM
handler with some kind of is_self function with a prototype like that:
bool handler->is_self(Page);
If the page is of the type of the handler, this returns true, and
false otherwise. Still here performance would suck.

At the end, what we want is a clean interface, and more thoughts into it.

I think that it makes sense to filter based on the resource manager
ID

+1.

Yes actually that's better. That's simple enough and removes any need
to looking at pd_special.

I think the patch currently addresses only a subset of resource
manager id's (mainly Heap and Index resource manager id's). Do we
want to handle the complete resource manager list as defined in
rmgrlist.h?

Not all of them generate FPWs. I don't think it matters much.

Another thing that needs some thoughts is the UI of this patch,
currently it is using integer mask which might not be best way, but
again as it is intended to be mainly used for tests, it might be okay.

What we'd want to have is a way to visualize easily differences of
pages. Any other ideas than MASK_MARKER would be welcome of course.

Do we want to enable some tests in the regression suite by using this option?

We could get out a recovery test that sets up a standby/master and
runs the tests of src/test/regress with pg_regress with this parameter
enabled.

+ * bufmask.c
+ *      Routines for buffer masking, used to ensure that buffers used for
+ *      comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
Copyright notices need to be updated. (It's already been 2 years!!)

Also, what's the use case of allowing only a certain set of rmgrs to
be checked. Wouldn't a simple on/off switch be simpler? As presented,
wal_consistency_mask is also going to be very quite confusing for
users. You should not need to apply some maths to set up this
parameter, a list of rmgr names may be more adapted if this level of
tuning is needed, still it seems to me that we don't need this much.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#7)
Re: WAL consistency check facility

On Tue, Aug 23, 2016 at 10:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Also, what's the use case of allowing only a certain set of rmgrs to
be checked. Wouldn't a simple on/off switch be simpler?

I think there should be a way to test WAL for one particular resource
manager. For example, if someone develops a new index or some other
heap storage, only that particular module can be verified. Generating
WAL for all the resource managers together can also serve the purpose,
but it will be slightly difficult to verify it.

As presented,
wal_consistency_mask is also going to be very quite confusing for
users. You should not need to apply some maths to set up this
parameter, a list of rmgr names may be more adapted if this level of
tuning is needed,

Yeah, that can be better.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#4)
Re: WAL consistency check facility

On 22 August 2016 at 16:56, Simon Riggs <simon@2ndquadrant.com> wrote:

On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Please let me know your thoughts on this.

Do the regression tests pass with this option enabled?

Hi,

I'd like to be a reviewer on this. Please can you add this onto the CF
app so we can track the review?

Please supply details of the testing and test coverage.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Simon Riggs (#9)
1 attachment(s)
Re: WAL consistency check facility

Hi,

I've added the feature in CP app. Following are the testing details:

1. In master, I've enabled following configurations:

* wal_level = replica
* max_wal_senders = 3
* wal_keep_segments = 4000
* hot_standby = on
* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

2. In slave, I've enabled following configurations:

* standby_mode = on
* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

3. Then, I performed gmake installcheck in master. I didn't get any
warning regarding WAL inconsistency in slave.

I've made following changes to the attached patch:

1. For BRIN pages, I've masked the unused space, PD_PAGE_FULL and
PD_HAS_FREE_LINES flags.
2. For Btree pages, I've masked BTP_HALF_DEAD, BTP_SPLIT_END,
BTP_HAS_GARBAGE and BTP_INCOMPLETE_SPLIT flags.
3. For GIN_DELETED page, I've masked the entire page since the page is
always initialized during recovery.
4. For Speculative Heap tuple insert operation, there was
inconsistency in t_ctid value. So, I've modified the t_ctid value (in
backup page) to current block number and offset number. Need
suggestions!!

What needs to be done:
1. Add support for other Resource Managers.
2. Modify masking techniques for existing Resource Managers (if required).
3. Modify the GUC parameter which will accept a list of rmgr names.
4. Modify the technique for identifying rmgr names for which the
feature should be enabled.
5. Generalize the page type identification technique.

On Wed, Aug 24, 2016 at 2:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 22 August 2016 at 16:56, Simon Riggs <simon@2ndquadrant.com> wrote:

On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Please let me know your thoughts on this.

Do the regression tests pass with this option enabled?

Hi,

I'd like to be a reviewer on this. Please can you add this onto the CF
app so we can track the review?

Please supply details of the testing and test coverage.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Thanks & Regards,
Kuntal Ghosh

Attachments:

walconsistency_v4.patchtext/x-patch; charset=US-ASCII; name=walconsistency_v4.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..7b64167 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -25,6 +25,7 @@
 #include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
+#include "access/rmgr.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
@@ -52,7 +53,9 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/barrier.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
+#include "storage/bufpage.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/large_object.h"
@@ -94,6 +97,7 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+int		wal_consistency_mask = 0;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -867,6 +871,9 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+void checkWALConsistency(XLogReaderState *xlogreader);
+void checkWALConsistencyForBlock(XLogReaderState *record, uint8 block_id);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6868,6 +6875,12 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * Check whether the page associated with WAL record is consistent
+				 * with the existing page
+				 */
+				checkWALConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -11626,3 +11639,161 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Check whether the page associated with WAL record is consistent with the
+ * existing page or not.
+ */
+void checkWALConsistency(XLogReaderState *xlogreader)
+{
+	RmgrIds rmid = (RmgrIds) XLogRecGetRmid(xlogreader);
+	int block_id;
+	int enableWALConsistencyMask = 1;
+	RmgrIds rmids[] = {RM_HEAP2_ID,RM_HEAP_ID,RM_BTREE_ID,RM_HASH_ID,RM_GIN_ID,RM_GIST_ID,RM_SEQ_ID,RM_SPGIST_ID,RM_BRIN_ID};
+	int size = sizeof(rmids)/sizeof(rmid);
+	int i;
+
+	for (i = 0; i < size; i++)
+	{
+		if (rmids[i]==rmid && (wal_consistency_mask & enableWALConsistencyMask))
+		{
+			for (block_id = 0; block_id <= xlogreader->max_block_id; block_id++)
+				checkWALConsistencyForBlock(xlogreader,block_id);
+			break;
+		}
+		/*
+		 * Enable checking for the next bit
+		 */
+		enableWALConsistencyMask <<= 1;
+	}
+}
+void checkWALConsistencyForBlock(XLogReaderState *record, uint8 block_id)
+{
+	Buffer buf;
+	char *ptr;
+	DecodedBkpBlock *bkpb;
+	char		tmp[BLCKSZ];
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	BlockNumber blkno;
+	Page		page;
+
+	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	{
+		/* Caller specified a bogus block_id. Don't do anything. */
+		return;
+	}
+	buf = XLogReadBufferExtended(rnode, forknum, blkno,
+									   RBM_WAL_CHECK);
+	page = BufferGetPage(buf);
+
+	bkpb = &record->blocks[block_id];
+	if(bkpb->bkp_image!=NULL)
+		ptr = bkpb->bkp_image;
+	else
+	{
+		elog(WARNING,
+				 "No page found in WAL for record %X/%X, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 (uint32) (record->ReadRecPtr>> 32), (uint32) record->ReadRecPtr ,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		return;
+	}
+
+	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
+	{
+		/* If a backup block image is compressed, decompress it */
+		if (pglz_decompress(ptr, bkpb->bimg_len, tmp,
+							BLCKSZ - bkpb->hole_length) < 0)
+		{
+			elog(ERROR, "invalid compressed image at %X/%X, block %d",
+								  (uint32) (record->ReadRecPtr >> 32),
+								  (uint32) record->ReadRecPtr,
+								  block_id);
+		}
+		ptr = tmp;
+	}
+	/*
+	 * If block restores the associated page during WAL replay,
+	 * adjust the block hole accordingly.
+	 */
+	if (bkpb->hole_length == 0)
+	{
+		memcpy(tmp, ptr, BLCKSZ);
+	}
+	else
+	{
+		memcpy(tmp, ptr, bkpb->hole_offset);
+		/* must zero-fill the hole */
+		MemSet(tmp + bkpb->hole_offset, 0, bkpb->hole_length);
+		memcpy(tmp + (bkpb->hole_offset + bkpb->hole_length),
+			ptr + bkpb->hole_offset,
+			BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
+	}
+	ptr = tmp;
+	char *norm_new_page, *norm_old_page;
+	char	old_buf[BLCKSZ * 2];
+	char	new_buf[BLCKSZ * 2];
+	int		j = 0;
+	int		i;
+	bool	inconsistent = false;
+
+	/* Mask pages */
+	norm_new_page = mask_page((Page)ptr, blkno);
+	norm_old_page = mask_page((Page)page, blkno);
+	/*
+	 * Convert the pages to be compared into hex format to facilitate
+	 * their comparison and make potential diffs more readable while
+	 * debugging.
+	 */
+	for (i = 0; i < BLCKSZ; i++)
+	{
+		const char *digits = "0123456789ABCDEF";
+		uint8 byte_new = (uint8) norm_new_page[i];
+		uint8 byte_old = (uint8) norm_old_page[i];
+
+		new_buf[j] = digits[byte_new >> 4];
+		old_buf[j] = digits[byte_old >> 4];
+		/*
+		 * Do an inclusive comparison, if the new buffer has a mask
+		 * marker and not the old buffer pages are inconsistent as this
+		 * would mean that the old page has content that the new buffer
+		 * has not.
+		 */
+		if (new_buf[j]!=old_buf[j])
+		{
+			inconsistent = true;
+			break;
+		}
+		j++;
+		new_buf[j] = digits[byte_new & 0x0F];
+		old_buf[j] = digits[byte_old & 0x0F];
+		if (new_buf[j]!=old_buf[j])
+		{
+			inconsistent = true;
+			break;
+		}
+		j++;
+	}
+
+	/* Time to compare the old and new contents */
+	if (inconsistent)
+		elog(WARNING,
+			 "Inconsistent page (at byte %u) found for record %X/%X, rel %u/%u/%u, "
+			 "forknum %u, blkno %u", i,
+			 (uint32) (record->ReadRecPtr>> 32), (uint32) record->ReadRecPtr ,
+			 rnode.spcNode, rnode.dbNode, rnode.relNode,
+			 forknum, blkno);
+	else
+		elog(DEBUG1,
+			 "Consistent page found for record %X/%X, rel %u/%u/%u, "
+			 "forknum %u, blkno %u",
+			 (uint32) (record->ReadRecPtr  >> 32), (uint32) record->ReadRecPtr ,
+			 rnode.spcNode, rnode.dbNode, rnode.relNode,
+			 forknum, blkno);
+
+	pfree(norm_new_page);
+	pfree(norm_old_page);
+	ReleaseBuffer(buf);
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c37003a..af4df2a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -513,7 +513,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
-
+		int		enableWALConsistencyMask = 1;
+		RmgrIds		rmids[] = {RM_HEAP2_ID,RM_HEAP_ID,RM_BTREE_ID,RM_HASH_ID,RM_GIN_ID,RM_GIST_ID,RM_SEQ_ID,RM_SPGIST_ID,RM_BRIN_ID};
+		int		size = sizeof(rmids)/sizeof(rmid);
+		int		i;
+		bool		needs_image_backup; /*Since, we always set needs_backup to true,
+							this field remembers the original value of needs_backup*/
 		if (!regbuf->in_use)
 			continue;
 
@@ -556,6 +561,24 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
+		/*
+		 * If wal consistency check is enabled for current rmid,
+		 * We do fpw for the current block.
+		 */
+		needs_image_backup = needs_backup;
+		for (i = 0; i < size; i++)
+		{
+			if (rmids[i]==rmid && (wal_consistency_mask & enableWALConsistencyMask))
+			{
+				needs_backup = true;
+				break;
+			}
+			/*
+			 * Enable checking for the next bit
+			 */
+			enableWALConsistencyMask <<= 1;
+		}
+
 		if (needs_backup)
 		{
 			Page		page = regbuf->page;
@@ -618,6 +641,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			if (needs_image_backup)
+				bimg.bimg_info |= BKPIMAGE_IS_REQUIRED;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index dcf747c..5e53df3 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1077,11 +1077,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			}
 			datatotal += blk->data_len;
 
+			blk->require_image=false;
 			if (blk->has_image)
 			{
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+				/*
+				 * If we require the image for any other purpose that wal consistency check
+				 * set require_image flag.
+				 */
+				if(blk->bimg_info & BKPIMAGE_IS_REQUIRED)
+					blk->require_image = true;
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1222,6 +1229,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
 		}
+		/*
+		 * If image is inserted in the WAL record for any other purpose than WAL
+		 * consistency check, set has_image=true, else set it to false.
+		 */
+		blk->has_image=blk->require_image;
 	}
 
 	/* and finally, the main data */
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index c98f981..eaf2d8b 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -49,16 +49,6 @@
 #define SEQ_LOG_VALS	32
 
 /*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
-typedef struct sequence_magic
-{
-	uint32		magic;
-} sequence_magic;
-
-/*
  * We store a SeqTable item for every sequence we have touched in the current
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
@@ -329,7 +319,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 {
 	Buffer		buf;
 	Page		page;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	OffsetNumber offnum;
 
 	/* Initialize first page of relation with special magic number */
@@ -339,9 +329,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	page = BufferGetPage(buf);
 
-	PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
-	sm->magic = SEQ_MAGIC;
+	PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	/* Now insert sequence tuple */
 
@@ -1109,18 +1099,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
 {
 	Page		page;
 	ItemId		lp;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	Form_pg_sequence seq;
 
 	*buf = ReadBuffer(rel, 0);
 	LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buf);
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
 
-	if (sm->magic != SEQ_MAGIC)
+	if (sm->seq_page_id != SEQ_MAGIC)
 		elog(ERROR, "bad magic number in sequence \"%s\": %08X",
-			 RelationGetRelationName(rel), sm->magic);
+			 RelationGetRelationName(rel), sm->seq_page_id);
 
 	lp = PageGetItemId(page, FirstOffsetNumber);
 	Assert(ItemIdIsNormal(lp));
@@ -1585,7 +1575,7 @@ seq_redo(XLogReaderState *record)
 	char	   *item;
 	Size		itemsz;
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 
 	if (info != XLOG_SEQ_LOG)
 		elog(PANIC, "seq_redo: unknown op code %u", info);
@@ -1604,9 +1594,9 @@ seq_redo(XLogReaderState *record)
 	 */
 	localpage = (Page) palloc(BufferGetPageSize(buffer));
 
-	PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(localpage);
-	sm->magic = SEQ_MAGIC;
+	PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	item = (char *) xlrec + sizeof(xl_seq_rec);
 	itemsz = XLogRecGetDataLen(record) - sizeof(xl_seq_rec);
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..d42e3f7
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,415 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/brin_page.h"
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/hash.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufmask.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+static void mask_unused_space(Page page);
+static void mask_page_lsn(Page page);
+static void mask_heap_page(Page page, BlockNumber blkno);
+static void mask_spgist_page(Page page);
+static void mask_gist_page(Page page);
+static void mask_gin_page(Page page, BlockNumber blkno);
+static void mask_sequence_page(Page page);
+static void mask_btree_page(Page page);
+static void mask_hash_page(Page page);
+static void mask_brin_page(Page page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int	pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+			 ((PageHeader) page)->pd_lsn.xlogid,
+			 ((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+/*
+ * Mask Page LSN
+ */
+static void
+mask_page_lsn(Page page)
+{
+
+	PageHeader phdr = (PageHeader) page;
+	PageXLogRecPtrSet(phdr->pd_lsn, 0xFFFFFFFFFFFFFFFF);
+}
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(Page page, BlockNumber blkno)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, I set it
+			 * to current block number and offset. Need suggestions!
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+			{
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+			}
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(Page page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(Page page)
+{
+	mask_unused_space(page);
+
+	/*Mask NSN*/
+	GistPageSetNSN(page, 0xFFFFFFFFFFFFFFFF);
+
+	/* Mask flag bits of a gist page*/
+	GistPageSetDeleted(page);
+	GistMarkTuplesDeleted(page);
+	GistMarkPageHasGarbage(page);
+	GistMarkFollowRight(page);
+}
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(Page page, BlockNumber blkno)
+{
+	GinPageOpaque opaque = GinPageGetOpaque(page);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+	{
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page);
+	}
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(Page page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(Page page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+			(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HALF_DEAD, BTP_SPLIT_END,
+	 * BTP_HAS_GARBAGE, BTP_INCOMPLETE_SPLIT flags
+	 */
+	maskopaq->btpo_flags |= 0xF0;
+}
+
+static void
+mask_hash_page(Page page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	HashPageOpaque opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+	/*
+	 * Mask everything on a UNUSED page.
+	 */
+	if (opaque->hasho_flag & LH_UNUSED_PAGE)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else if ((opaque->hasho_flag & LH_META_PAGE)==0)
+	{
+		/*
+		 * For pages other than metapage,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+}
+
+/*
+ * Mask a BRIN page
+ */
+static void
+mask_brin_page(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	mask_unused_space(page);
+}
+
+/*
+ * mask_page
+ *
+ * Mask a given page. First try to find what kind of page it is
+ * and then normalize it. This function returns a normalized page
+ * palloc'ed. So caller should free the normalized page correctly when
+ * using this function. Tracking blkno is needed for gin pages as their
+ * metapage does not use pd_lower and pd_upper.
+ * Before calling this function, it is assumed that caller has already
+ * taken a proper lock on the page being masked.
+ */
+char *
+mask_page(const char *page, BlockNumber blkno)
+{
+	Page	page_norm;
+	uint16	tail;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		/* Case of a normal relation, it has an empty special area */
+		mask_heap_page(page_norm, blkno);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) &&
+			 tail == GIST_PAGE_ID)
+	{
+		/* Gist page */
+		mask_gist_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) &&
+			 tail <= MAX_BT_CYCLE_ID)
+	{
+		/* btree page */
+		mask_btree_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) &&
+			 tail == SPGIST_PAGE_ID)
+	{
+		/* SpGist page */
+		mask_spgist_page(page_norm);
+	}
+	else if(PageGetSpecialSize(page) == MAXALIGN(sizeof(HashPageOpaqueData)) &&
+			tail == HASHO_PAGE_ID)
+	{
+		mask_hash_page(page_norm);
+	}
+	else if(BRIN_IS_META_PAGE(page) || BRIN_IS_REVMAP_PAGE(page) || BRIN_IS_REGULAR_PAGE(page))
+	{
+		mask_brin_page(page_norm);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(SequencePageOpaqueData)))
+	{
+		/*
+		 * The page found here is used either for a Gin index or a sequence.
+		 * Gin index pages do not have a proper identifier, so check if the page
+		 * is used by a sequence or not. If it is not the case, this page is used
+		 * by a gin index. It is still possible that a gin page covers with area
+		 * with exactly the same value as SEQ_MAGIC, but this is unlikely to happen.
+		 */
+		if (((SequencePageOpaqueData *) PageGetSpecialPointer(page))->seq_page_id == SEQ_MAGIC)
+			mask_sequence_page(page_norm);
+		else
+			mask_gin_page(page_norm, blkno);
+	}
+	else
+	{
+		/* Should not come here*/
+		Assert(0);
+	}
+
+	/* Return normalized page */
+	return (char *) page_norm;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6ac5184..645a807 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1800,6 +1800,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"wal_consistency_mask", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Mask to enable WAL consistency for HEAP_INSERT/HEAP_INSERT2."),
+			NULL
+		},
+		&wal_consistency_mask,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum wait time to receive data from the primary."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6d0666c..e7e21ed 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,17 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency_mask = 0		# enable WAL consistency check for different operations
+					# bit 0 - HEAP2
+					# bit 1 - HEAP
+					# bit 2 - BTREE
+					# bit 3 - HASH
+					# bit 4 - GIN
+					# bit 5 - GIST
+					# bit 6 - SEQ
+					# bit 7 - SPGIST
+					# bit 8 - BRIN
+					# Multiple bits can also be enabled. For example, to enable HEAP and HASH, set the value to 10
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 14b7f7f..1fc5f6e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -104,6 +104,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern int wal_consistency_mask;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..287143b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,8 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		require_image; /* This field contains the true value of has_image.
+					Because, if wal consistency check is enabled, has_image will always be true.*/
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..34e28c0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -137,7 +137,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
-
+#define BKPIMAGE_IS_REQUIRED		0x04	/* page is required by the WAL record */
 /*
  * Extra header information used when page image has "hole" and
  * is compressed.
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 6af60d8..a7a0e16 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -20,6 +20,19 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * Page opaque data in a sequence page
+ */
+typedef struct SequencePageOpaqueData
+{
+	uint32 seq_page_id;
+} SequencePageOpaqueData;
+
+/*
+ * This page ID is for the conveniende to be able to identify if a page
+ * is being used by a sequence.
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..1dd5a67
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+
+/* Entry point for page masking */
+extern char *mask_page(const char *page, BlockNumber blkno);
+
+#endif
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..6621a39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -44,8 +44,9 @@ typedef enum
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
-	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
+	RBM_NORMAL_NO_LOG,			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+	RBM_WAL_CHECK		/*Normal read, but don't check whether the page is new or not. */
 } ReadBufferMode;
 
 /* forward declared, to avoid having to expose buf_internals.h here */
#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kuntal Ghosh (#10)
Re: WAL consistency check facility

Kuntal Ghosh wrote:

4. For Speculative Heap tuple insert operation, there was
inconsistency in t_ctid value. So, I've modified the t_ctid value (in
backup page) to current block number and offset number. Need
suggestions!!

In speculative insertions, t_ctid is used to store the speculative
token. I think you should just mask that field out in that case (which
you can recognize because ip_posid is set to magic value 0xfffe).

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Alvaro Herrera (#11)
Re: WAL consistency check facility

Thanks a lot.

I just want to mention the situation where I was getting the
speculative token related inconsistency.

ItemPointer in backup page from master:
LOG: ItemPointer BlockNumber: 1 OffsetNumber:65534 Speculative: true
CONTEXT: xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

ItemPointer in current page from slave after redo:
LOG: ItemPointer BlockNumber: 0 OffsetNumber:1 Speculative: false
CONTEXT: xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

As the block numbers are different, I was getting the following warning:
WARNING: Inconsistent page (at byte 8166) found for record
0/127F4A48, rel 1663/16384/16946, forknum 0, blkno 0, Backup Page
Header : (pd_lower: 28 pd_upper: 8152 pd_special: 8192) Current Page
Header: (pd_lower: 28 pd_upper: 8152 pd_special: 8192)
CONTEXT: xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

In heap_xlog_insert, t_ctid is always set to blkno and xlrec->offnum.
I think this is why I was getting the above warning.

On Thu, Aug 25, 2016 at 10:33 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Kuntal Ghosh wrote:

4. For Speculative Heap tuple insert operation, there was
inconsistency in t_ctid value. So, I've modified the t_ctid value (in
backup page) to current block number and offset number. Need
suggestions!!

In speculative insertions, t_ctid is used to store the speculative
token. I think you should just mask that field out in that case (which
you can recognize because ip_posid is set to magic value 0xfffe).

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Thanks & Regards,
Kuntal Ghosh

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kuntal Ghosh (#12)
Re: WAL consistency check facility

Kuntal Ghosh wrote:

Thanks a lot.

I just want to mention the situation where I was getting the
speculative token related inconsistency.

ItemPointer in backup page from master:
LOG: ItemPointer BlockNumber: 1 OffsetNumber:65534 Speculative: true
CONTEXT: xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

ItemPointer in current page from slave after redo:
LOG: ItemPointer BlockNumber: 0 OffsetNumber:1 Speculative: false
CONTEXT: xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

As the block numbers are different, I was getting the following warning:
WARNING: Inconsistent page (at byte 8166) found for record
0/127F4A48, rel 1663/16384/16946, forknum 0, blkno 0, Backup Page
Header : (pd_lower: 28 pd_upper: 8152 pd_special: 8192) Current Page
Header: (pd_lower: 28 pd_upper: 8152 pd_special: 8192)
CONTEXT: xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

In heap_xlog_insert, t_ctid is always set to blkno and xlrec->offnum.
I think this is why I was getting the above warning.

Umm, really? Then perhaps this *is* a bug. Peter?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Simon Riggs
simon@2ndquadrant.com
In reply to: Kuntal Ghosh (#10)
Re: WAL consistency check facility

Hi Kuntal,

Thanks for the patch.

Current patch has no docs, no tests and no explanatory comments, so
makes review quite hard.

The good news is you might discover a few bugs with it, so its worth
pursuing actively in this CF, though its not near to being
committable.

I think you should add this as part of the default testing for both
check and installcheck. I can't imagine why we'd have it and not use
it during testing.

On 25 August 2016 at 18:41, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

What needs to be done:
1. Add support for other Resource Managers.

We probably need to have a discussion as to why you think this should
be Rmgr dependent?
Code comments would help there.

If it does, then you should probably do this by extending RmgrTable
with an rm_check, so you can call it like this...

RmgrTable[record->xl_rmid].rm_check

I'm interested in how we handle the new generic WAL format for blocks.
Surely if we can handle that then we won't need an Rmgr dependency?
I'm sure you have reasons, they just need to be explained long hand -
don't assume anything.

5. Generalize the page type identification technique.

Why not do this first?

There are some coding guideline stuff to check as well.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#14)
Re: WAL consistency check facility

On Fri, Aug 26, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I think you should add this as part of the default testing for both
check and installcheck. I can't imagine why we'd have it and not use
it during testing.

The actual consistency checks are done during redo (replay), so not
sure whats in you mind for enabling it with check or installcheck. I
think we can run few recovery/replay tests with this framework. Also
running the tests under this framework could be time-consuming as it
logs the entire page for each WAL record we write and then during
replay reads the same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Simon Riggs
simon@2ndquadrant.com
In reply to: Amit Kapila (#15)
Re: WAL consistency check facility

On 27 August 2016 at 07:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 26, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I think you should add this as part of the default testing for both
check and installcheck. I can't imagine why we'd have it and not use
it during testing.

The actual consistency checks are done during redo (replay), so not
sure whats in you mind for enabling it with check or installcheck. I
think we can run few recovery/replay tests with this framework. Also
running the tests under this framework could be time-consuming as it
logs the entire page for each WAL record we write and then during
replay reads the same.

I'd like to see an automated test added so we can be certain we don't
add things that break recovery. Don't mind much where or how.

The main use is to maintain that certainty while in production.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Simon Riggs (#14)
Re: WAL consistency check facility

Hello Simon,

I'm really sorry for the inconveniences. Next time, I'll attach the
patch with proper documentation, test and comments.

I think you should add this as part of the default testing for both
check and installcheck. I can't imagine why we'd have it and not use
it during testing.

Since, this is redo(replay) feature, we can surely add this in
installcheck. But, as Amit mentioned, it could be time-consuming.

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

1. Add support for other Resource Managers.

We probably need to have a discussion as to why you think this should
be Rmgr dependent?
Code comments would help there.

If it does, then you should probably do this by extending RmgrTable
with an rm_check, so you can call it like this...

RmgrTable[record->xl_rmid].rm_check

+1.
I'm modifying it accordingly. I'm calling this function after
RmgrTable[record->xl_rmid].rm_redo.

5. Generalize the page type identification technique.

Why not do this first?

At present, I'm using special page size and page ID to identify page
type. But, I've noticed some cases where the entire page is
initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
can help us to identify those pages.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#16)
Re: WAL consistency check facility

On Sat, Aug 27, 2016 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 August 2016 at 07:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 26, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I think you should add this as part of the default testing for both
check and installcheck. I can't imagine why we'd have it and not use
it during testing.

The actual consistency checks are done during redo (replay), so not
sure whats in you mind for enabling it with check or installcheck. I
think we can run few recovery/replay tests with this framework. Also
running the tests under this framework could be time-consuming as it
logs the entire page for each WAL record we write and then during
replay reads the same.

I'd like to see an automated test added so we can be certain we don't
add things that break recovery. Don't mind much where or how.

The main use is to maintain that certainty while in production.

For developers, having extra checks with the new routines in WAL_DEBUG
could also be useful for a code path producing WAL. Let's not forget
that as well.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#13)
Re: WAL consistency check facility

On Fri, Aug 26, 2016 at 7:24 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

As the block numbers are different, I was getting the following warning:
WARNING: Inconsistent page (at byte 8166) found for record
0/127F4A48, rel 1663/16384/16946, forknum 0, blkno 0, Backup Page
Header : (pd_lower: 28 pd_upper: 8152 pd_special: 8192) Current Page
Header: (pd_lower: 28 pd_upper: 8152 pd_special: 8192)
CONTEXT: xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

In heap_xlog_insert, t_ctid is always set to blkno and xlrec->offnum.
I think this is why I was getting the above warning.

Umm, really? Then perhaps this *is* a bug. Peter?

It's a matter of perspective, but probably not. The facts are:

* heap_insert() treats speculative insertions differently. In
particular, it does not set ctid in the caller-passed heap tuple
itself, leaving the ctid field to contain a speculative token value --
a per-backend monotonically increasing identifier. This identifier
represents some particular speculative insertion attempt within a
backend.

* On the redo side, heap_xlog_insert() was only changed mechanically
when upsert went in. So, it doesn't actually care about the stuff that
heap_insert() was made to care about to support speculative insertion.

* Furthermore, heap_insert() does *not* WAL log ctid under any
circumstances (that didn't change, either). Traditionally, the ctid
field was completely redundant anyway (since, of course, we're only
dealing with newly inserted tuples in heap_insert()). With speculative
insertions, there is a token within ctid, whose value represents
actual information that cannot be reconstructed after the fact (the
speculative token value). Even still, that isn't WAL-logged (see
comments above xl_heap_header struct). That's okay, because the
speculative insertion token value is only needed due to obscure issues
with "unprincipled deadlocks". The speculative token value itself is
only of interest to other, conflicting inserters, and only for the
instant in time immediately following physical insertion. The token
doesn't matter in the slightest to crash recovery, nor to Hot Standby
replicas.

While this design had some really nice properties (ask me if you are
unclear on this), it does break tools like the proposed WAL-checker
tool. I would compare this speculative token situation to the
situation with hint bits (when checksums are disabled, and
wal_log_hints = off).

I actually have a lot of sympathy for the idea that, in general, cases
like this should be avoided. But, would it really be worth it to
create a useless special case for speculative insertion (to WAL-log
the virtually useless speculative insertion token value)? I'm certain
that the answer must be "no": This tool ought to deal with speculative
insertion as a special case, and not vice-versa.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Peter Geoghegan
pg@heroku.com
In reply to: Kuntal Ghosh (#10)
Re: WAL consistency check facility

On Thu, Aug 25, 2016 at 9:41 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. For Btree pages, I've masked BTP_HALF_DEAD, BTP_SPLIT_END,
BTP_HAS_GARBAGE and BTP_INCOMPLETE_SPLIT flags.

Why? I think that you should only perform this kind of masking where
it's clearly strictly necessary.

It is true that nbtree can allow cases where LP_DEAD is set with only
a share lock (by read-only queries), so I can see why BTP_HAS_GARBAGE
might need to be excluded; this is comparable to the heapam's use of
hint bits. However, it is unclear why you need to mask the remaining
btpo_flags that you list, because the other flags have clear-cut roles
in various atomic operations that we WAL-log.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#20)
Re: WAL consistency check facility

On Sun, Aug 28, 2016 at 6:26 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 25, 2016 at 9:41 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. For Btree pages, I've masked BTP_HALF_DEAD, BTP_SPLIT_END,
BTP_HAS_GARBAGE and BTP_INCOMPLETE_SPLIT flags.

Why? I think that you should only perform this kind of masking where
it's clearly strictly necessary.

It is true that nbtree can allow cases where LP_DEAD is set with only
a share lock (by read-only queries), so I can see why BTP_HAS_GARBAGE
might need to be excluded; this is comparable to the heapam's use of
hint bits. However, it is unclear why you need to mask the remaining
btpo_flags that you list, because the other flags have clear-cut roles
in various atomic operations that we WAL-log.

Right, I think there is no need to mask all the flags. However apart
from BTP_HAS_GARBAGE, it seems we should mask BTP_SPLIT_END as that is
just used to save some processing for vaccum and won't be set after
crash recovery or on standby after WAL replay.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Peter Geoghegan
pg@heroku.com
In reply to: Amit Kapila (#21)
Re: WAL consistency check facility

On Sat, Aug 27, 2016 at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think there is no need to mask all the flags. However apart
from BTP_HAS_GARBAGE, it seems we should mask BTP_SPLIT_END as that is
just used to save some processing for vaccum and won't be set after
crash recovery or on standby after WAL replay.

Right you are -- while BTP_INCOMPLETE_SPLIT is set during recovery,
BTP_SPLIT_END is not. Still, most of the btpo_flags flags that are
masked in the patch shouldn't be.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Peter Geoghegan (#22)
Re: WAL consistency check facility

Thank you. I've updated it accordingly.

On Sun, Aug 28, 2016 at 11:20 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Sat, Aug 27, 2016 at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think there is no need to mask all the flags. However apart
from BTP_HAS_GARBAGE, it seems we should mask BTP_SPLIT_END as that is
just used to save some processing for vaccum and won't be set after
crash recovery or on standby after WAL replay.

Right you are -- while BTP_INCOMPLETE_SPLIT is set during recovery,
BTP_SPLIT_END is not. Still, most of the btpo_flags flags that are
masked in the patch shouldn't be.

--
Peter Geoghegan

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Simon Riggs
simon@2ndquadrant.com
In reply to: Kuntal Ghosh (#17)
Re: WAL consistency check facility

On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?

1. Add support for other Resource Managers.

We probably need to have a discussion as to why you think this should
be Rmgr dependent?
Code comments would help there.

If it does, then you should probably do this by extending RmgrTable
with an rm_check, so you can call it like this...

RmgrTable[record->xl_rmid].rm_check

+1.
I'm modifying it accordingly. I'm calling this function after
RmgrTable[record->xl_rmid].rm_redo.

5. Generalize the page type identification technique.

Why not do this first?

At present, I'm using special page size and page ID to identify page
type. But, I've noticed some cases where the entire page is
initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
can help us to identify those pages.

I'd prefer a solution that was not dependent upon RmgrID at all.

If there are various special cases that we need to cater for, ISTM
they would be flaws in the existing WAL implementation rather than
anything we would want to perpetuate. I hope we'll spend time fixing
them rather than add loads of weird code to work around the
imperfections.

Underdocumented special case code is going to be unbelievably
difficult to get right in the long term.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#24)
Re: WAL consistency check facility

On Wed, Aug 31, 2016 at 10:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?

I am still in for just an on/off switch instead of this complication.
An all-or-nothing feature is what we are looking at here. Still a list
is an improvement compared to a bitmap.

1. Add support for other Resource Managers.

We probably need to have a discussion as to why you think this should
be Rmgr dependent?
Code comments would help there.

If it does, then you should probably do this by extending RmgrTable
with an rm_check, so you can call it like this...

RmgrTable[record->xl_rmid].rm_check

+1.
I'm modifying it accordingly. I'm calling this function after
RmgrTable[record->xl_rmid].rm_redo.

5. Generalize the page type identification technique.

Why not do this first?

At present, I'm using special page size and page ID to identify page
type. But, I've noticed some cases where the entire page is
initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
can help us to identify those pages.

I'd prefer a solution that was not dependent upon RmgrID at all.

So you'd rather identify the page types by looking at pd_special? That
seems worse to me but..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#24)
Re: WAL consistency check facility

On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?

It would be easier to test and develop the various modules separately.
As an example, if we develop a new AM which needs WAL facility or
adding WAL capability to an existing system (say Hash Index), we can
just test that module, rather than whole system. I think it can help
us in narrowing down the problem, if we have facility to enable it at
RMGR ID level. Having said that, I think this must have the facility
to enable it for all the RMGR ID's (say ALL) and probably that should
be default.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Michael Paquier
michael.paquier@gmail.com
In reply to: Amit Kapila (#26)
Re: WAL consistency check facility

On Thu, Sep 1, 2016 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?

It would be easier to test and develop the various modules separately.
As an example, if we develop a new AM which needs WAL facility or
adding WAL capability to an existing system (say Hash Index), we can
just test that module, rather than whole system. I think it can help
us in narrowing down the problem, if we have facility to enable it at
RMGR ID level. Having said that, I think this must have the facility
to enable it for all the RMGR ID's (say ALL) and probably that should
be default.

As far as I am understanding things, we are aiming at something that
could be used on production systems. And, honestly, any people
enabling it would just do it for all RMGRs because that's a
no-brainer. If we are designing something for testing purposes
instead, something is wrong with this patch then.

Doing filtering at RMGR level for testing and development purposes
will be done by somebody who has the skills to filter out which
records he should look at. Or he'll bump into an existing bump. So I'd
rather keep this thing simple.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#27)
Re: WAL consistency check facility

On Thu, Sep 1, 2016 at 8:30 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 1, 2016 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?

It would be easier to test and develop the various modules separately.
As an example, if we develop a new AM which needs WAL facility or
adding WAL capability to an existing system (say Hash Index), we can
just test that module, rather than whole system. I think it can help
us in narrowing down the problem, if we have facility to enable it at
RMGR ID level. Having said that, I think this must have the facility
to enable it for all the RMGR ID's (say ALL) and probably that should
be default.

As far as I am understanding things, we are aiming at something that
could be used on production systems.

I don't think you can enable it by default in production systems.
Enabling it will lead to significant performance drop as it writes the
whole page after each record for most type of RMGR ID's.

And, honestly, any people
enabling it would just do it for all RMGRs because that's a
no-brainer.

Agreed, but remember enabling it for all is not free.

If we are designing something for testing purposes
instead, something is wrong with this patch then.

What is wrong?

Doing filtering at RMGR level for testing and development purposes
will be done by somebody who has the skills to filter out which
records he should look at.

Right, but in that way, if you see many of our guc parameters needs a
good level of understanding to set the correct values for them. For
example, do you think it is easy for user to set value for
"replacement_sort_tuples" without reading the description or
understanding the meaning of same. This example might not be the best
example, but I think there are other parameters which do require some
deeper understanding of system. The main thing is default values for
such parameters should be chosen carefully such that it represents
most common usage.

Or he'll bump into an existing bump. So I'd
rather keep this thing simple.

It seems to me that having an option of 'ALL' would make it easier for
users to set it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#26)
Re: WAL consistency check facility

On Thu, Sep 1, 2016 at 8:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?

It would be easier to test and develop the various modules separately.
As an example, if we develop a new AM which needs WAL facility or
adding WAL capability to an existing system (say Hash Index), we can
just test that module, rather than whole system. I think it can help
us in narrowing down the problem, if we have facility to enable it at
RMGR ID level. Having said that, I think this must have the facility
to enable it for all the RMGR ID's (say ALL) and probably that should
be default.

oops, I think having an option of specifying 'ALL' is good, but that
shouldn't be default, because it could have serious performance
implications.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Peter Geoghegan
pg@heroku.com
In reply to: Amit Kapila (#28)
Re: WAL consistency check facility

On Wed, Aug 31, 2016 at 8:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As far as I am understanding things, we are aiming at something that
could be used on production systems.

I don't think you can enable it by default in production systems.
Enabling it will lead to significant performance drop as it writes the
whole page after each record for most type of RMGR ID's.

And, honestly, any people
enabling it would just do it for all RMGRs because that's a
no-brainer.

Agreed, but remember enabling it for all is not free.

I have sympathy for the idea that this should be as low overhead as
possible, even if that means adding complexity to the interface --
within reason. I would like to hear a practical example of where this
RMGR id interface could be put to good use, when starting with little
initial information about a problem. And, ideally, we'd also have some
indication of how big a difference that would make, it terms of
measurable performance impact.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#30)
Re: WAL consistency check facility

On Thu, Sep 1, 2016 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Aug 31, 2016 at 8:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As far as I am understanding things, we are aiming at something that
could be used on production systems.

I don't think you can enable it by default in production systems.
Enabling it will lead to significant performance drop as it writes the
whole page after each record for most type of RMGR ID's.

And, honestly, any people
enabling it would just do it for all RMGRs because that's a
no-brainer.

Agreed, but remember enabling it for all is not free.

I have sympathy for the idea that this should be as low overhead as
possible, even if that means adding complexity to the interface --
within reason. I would like to hear a practical example of where this
RMGR id interface could be put to good use, when starting with little
initial information about a problem.

One example that comes to mind is for the cases where the problem
reproduces only under high concurrency or some stress test. Now assume
the problem is with index, enabling it for all rmgr's could reduce the
probability of problem due to it's performance impact. The second
advantage which I have already listed is it helps in future
development like the one I am doing now for hash indexes (making them
logged).

And, ideally, we'd also have some
indication of how big a difference that would make, it terms of
measurable performance impact.

Yes, that's a valid point. I think we can do some tests to see the difference.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#24)
Re: WAL consistency check facility

On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'd prefer a solution that was not dependent upon RmgrID at all.

If there are various special cases that we need to cater for, ISTM
they would be flaws in the existing WAL implementation rather than
anything we would want to perpetuate. I hope we'll spend time fixing
them rather than add loads of weird code to work around the
imperfections.

Underdocumented special case code is going to be unbelievably
difficult to get right in the long term.

It seems to me that you may be conflating the issue of which changes
should be masked out as hints (which is, indeed, special case code,
whether underdocumented or not) with the issue of which rmgrs the user
may want to verify (which is just a case of matching the rmgr ID in
the WAL record against a list provided by the user, and is not special
case code at all).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Simon Riggs
simon@2ndquadrant.com
In reply to: Robert Haas (#32)
Re: WAL consistency check facility

On 1 September 2016 at 11:16, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'd prefer a solution that was not dependent upon RmgrID at all.

If there are various special cases that we need to cater for, ISTM
they would be flaws in the existing WAL implementation rather than
anything we would want to perpetuate. I hope we'll spend time fixing
them rather than add loads of weird code to work around the
imperfections.

Underdocumented special case code is going to be unbelievably
difficult to get right in the long term.

It seems to me that you may be conflating the issue of which changes
should be masked out as hints (which is, indeed, special case code,
whether underdocumented or not) with the issue of which rmgrs the user
may want to verify (which is just a case of matching the rmgr ID in
the WAL record against a list provided by the user, and is not special
case code at all).

Yep, it seems entirely likely that I am misunderstanding what is
happening here. I'd like to see an analysis/discussion before we write
code. As you might expect, I'm excited by this feature and the
discoveries it appears likely to bring.

We've got wal_log_hints and that causes lots of extra traffic. I'm
happy with assuming that is switched on in this case also. (Perhaps we
might have a wal_log_level with various levels of logging.)

So in my current understanding, a hinted change has by definition no
WAL record, so we just ship a FPW. A non-hint change has a WAL record
and it is my (possibly naive) hope that all changes to a page are
reflected in the WAL record/replay, so we can just make a simple
comparison without caring what is the rmgr of the WAL record.

If we can start by discussing which special cases we know about that
require extra code, that will help. We can then decide whether to fix
the WAL record/replay or fix the comparison logic, possibly on a case
by case basis. My current preference is to generate lots of small
fixes to existing WAL code and then have a very, very simple patch for
this actual feature, but am willing to discuss alternate approaches.

IMV this would be a feature certain users would want turned on all the
time for everything. So I'm not bothered much about making this
feature settable by rmgr. I might question why this particular feature
would be settable by rmgr, when features like wal_log_hints and
wal_compression are not, but such discussion is a minor point in
comparison to discussing the main feature.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#33)
Re: WAL consistency check facility

On Thu, Sep 1, 2016 at 4:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

So in my current understanding, a hinted change has by definition no
WAL record, so we just ship a FPW.

Hmm. An FPW would have to be contained in a WAL record, so it can't
be right to say that we ship an FPW for lack of a WAL record. I think
what we ship is nothing at all when wal_log_hints is disabled, and
when wal_log_hints is enabled we log an FDW once per checkpoint.

A non-hint change has a WAL record
and it is my (possibly naive) hope that all changes to a page are
reflected in the WAL record/replay,

I hope for this, too.

so we can just make a simple
comparison without caring what is the rmgr of the WAL record.

Sure, that is 100% possible, and likely a good idea as far as the
behavior on the standby is concerned. What's not so clear is whether
a simple on/off switch is a wise plan on the master.

The purpose of this code, as I understand it, is to check for
discrepancies between "do" and "redo"; that is, to verify that the
changes made to the buffer at the time the WAL record is generated
produce the same result as replay of that WAL record on the standby.
To accomplish this purpose, a post-image of the affected buffers is
included in each and every WAL record. On replay, that post-image can
be compared with the result of replay. If they differ, PostgreSQL has
a bug. I would not expect many users to run this in production,
because it will presumably be wicked expensive. If I recall
correctly, doing full page writes once per buffer per checkpoint, the
space taken up by FPWs is >75% of WAL volume. Doing it for every
record will be exponentially more expensive. The primary audience of
this feature is PostgreSQL developers, who might want to use it to try
to verify that, for example, Amit's patch to add write-ahead logging
for hash indexes does not have bugs.[1]It probably has bugs.

Indeed, it had occurred to me that we might not even want to compile
this code into the server unless WAL_DEBUG is defined; after all, how
does it help a regular user to detect that the server has a bug? Bug
or no bug, that's the code they've got. But on further reflection, it
seems like it could be useful: if we suspect a bug in the redo code
but we can't reproduce it here, we could ask the customer to turn this
option on to see whether it produces logging indicating the nature of
the problem. However, because of the likely expensive of enabling the
feature, it seems like it would be quite desirable to limit the
expense of generating many extra FPWs to the affected rmgr. For
example, if a user has a table with a btree index and a gin index, and
we suspect a bug in GIN, it would be nice for the user to be able to
enable the feature *only for GIN* rather than paying the cost of
enabling it for btree and heap as well.[2]One could of course add filtering considerably more complex than per-rmgr - e.g. enabling it for only one particular relfilenode on a busy production system might be rather desirable. But I'm not sure we really want to go there. It adds a fair amount of complexity to a feature that many people are obviously hoping will be quite simple to use.

Similarly, when we imagine a developer using this feature to test for
bugs, it may at times be useful to enable it across-the-board to look
for bugs in any aspect of the write-ahead logging system. However, at
other times, when the goal is to find bugs in a particular AM, it
might be useful to enable it only for the corresponding rmgr. It is
altogether likely that this feature will slow the system down quite a
lot. If enabling this feature for hash indexes also means enabling it
for the heap, the incremental performance hit might be sufficient to
mask concurrency-related bugs in the hash index code that would
otherwise have been found. So, I think having at least some basic
filtering is probably a pretty smart idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]: It probably has bugs.
[2]: One could of course add filtering considerably more complex than per-rmgr - e.g. enabling it for only one particular relfilenode on a busy production system might be rather desirable. But I'm not sure we really want to go there. It adds a fair amount of complexity to a feature that many people are obviously hoping will be quite simple to use.
per-rmgr - e.g. enabling it for only one particular relfilenode on a
busy production system might be rather desirable. But I'm not sure we
really want to go there. It adds a fair amount of complexity to a
feature that many people are obviously hoping will be quite simple to
use.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Simon Riggs
simon@2ndquadrant.com
In reply to: Robert Haas (#34)
Re: WAL consistency check facility

On 1 September 2016 at 17:23, Robert Haas <robertmhaas@gmail.com> wrote:

The primary audience of this feature is PostgreSQL developers

I have spoken to users who are waiting for this feature to run in
production, which is why I suggested it.

Some people care more about correctness than they do about loss of performance.

Obviously, this would be expensive and those with a super high
performance requirement may not be able to take advantage of this. I'm
sure many people will turn it off once if they hit a performance
issue, but running it in production for the first few months will give
people a very safe feeling.

I think the primary use for an rmgr filter might well be PostgreSQL developers.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#34)
Re: WAL consistency check facility

On Thu, Sep 1, 2016 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Indeed, it had occurred to me that we might not even want to compile
this code into the server unless WAL_DEBUG is defined; after all, how
does it help a regular user to detect that the server has a bug? Bug
or no bug, that's the code they've got. But on further reflection, it
seems like it could be useful: if we suspect a bug in the redo code
but we can't reproduce it here, we could ask the customer to turn this
option on to see whether it produces logging indicating the nature of
the problem. However, because of the likely expensive of enabling the
feature, it seems like it would be quite desirable to limit the
expense of generating many extra FPWs to the affected rmgr. For
example, if a user has a table with a btree index and a gin index, and
we suspect a bug in GIN, it would be nice for the user to be able to
enable the feature *only for GIN* rather than paying the cost of
enabling it for btree and heap as well.[2]

Yes, that would be rather a large advantage.

I think that there really is no hard distinction between users and
hackers. Some people will want to run this in production, and it would
be a lot better if performance was at least not atrocious. If amcheck
couldn't do the majority of its verification with only an
AccessShareLock, then users probably just couldn't use it. Heroku
wouldn't have been able to use it on all production databases. It
wouldn't have mattered that the verification was no less effective,
since the bugs it found would simply never have been observed in
practice.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Peter Geoghegan (#36)
1 attachment(s)
Re: WAL consistency check facility

Hello,

As per the earlier discussions, I've attached the updated patch for
WAL consistency check feature. This is how the patch works:

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.
- I've extended the RmgrTable with a new function pointer
rm_checkConsistency, which is called after rm_redo. (only when WAL
consistency check is enabled for this rmgrID)
- In each rm_checkConsistency, both backup pages and buffer pages are
masked accordingly before any comparison.
- In postgresql.conf, a new guc variable named 'wal_consistency' is
added. Default value of this variable is 'None'. Valid values are
combinations of Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist,
BRIN, Generic and XLOG. It can also be set to 'All' to enable all the
values.
- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

Results
------------------------

I've tested with installcheck and installcheck-world in master-standby
set-up. Followings are the configuration parameters.

Master:
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 4000
hot_standby = on
wal_consistency = 'All'

Standby:
wal_consistency = 'All'

I got two types of inconsistencies as following:

1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
redo, only BTP_DELETED flag is set in buffer page. I assume that we
should clear all btpo_flags before setting BTP_DELETED in
_bt_unlink_halfdead_page().

2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
REVMAP page. This happens only for two cases. I'm not sure what the
reason can be.

I haven't done sufficient tests yet to measure the overhead of this
modification. I'll do that next.

Thanks to Amit Kapila, Dilip Kumar and Robert Haas for their off-line
suggestions.

Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

On Thu, Sep 1, 2016 at 11:34 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Sep 1, 2016 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Indeed, it had occurred to me that we might not even want to compile
this code into the server unless WAL_DEBUG is defined; after all, how
does it help a regular user to detect that the server has a bug? Bug
or no bug, that's the code they've got. But on further reflection, it
seems like it could be useful: if we suspect a bug in the redo code
but we can't reproduce it here, we could ask the customer to turn this
option on to see whether it produces logging indicating the nature of
the problem. However, because of the likely expensive of enabling the
feature, it seems like it would be quite desirable to limit the
expense of generating many extra FPWs to the affected rmgr. For
example, if a user has a table with a btree index and a gin index, and
we suspect a bug in GIN, it would be nice for the user to be able to
enable the feature *only for GIN* rather than paying the cost of
enabling it for btree and heap as well.[2]

Yes, that would be rather a large advantage.

I think that there really is no hard distinction between users and
hackers. Some people will want to run this in production, and it would
be a lot better if performance was at least not atrocious. If amcheck
couldn't do the majority of its verification with only an
AccessShareLock, then users probably just couldn't use it. Heroku
wouldn't have been able to use it on all production databases. It
wouldn't have mattered that the verification was no less effective,
since the bugs it found would simply never have been observed in
practice.

--
Peter Geoghegan

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v6.patchtext/x-patch; charset=US-ASCII; name=walconsistency_v6.patchDownload
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 27ba0a9..4c63ded 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,7 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
-
+#include "storage/bufmask.h"
 
 /*
  * xlog replay routines
@@ -286,3 +286,84 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+brin_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Mask pages */
+		norm_new_page = mask_brin_page(info, blkno, new_page);
+		norm_old_page = mask_brin_page(info, blkno, old_page);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..09760e0 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,84 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+gin_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Mask Pages */
+		norm_new_page = mask_gin_page(info, blkno, new_page);
+		norm_old_page = mask_gin_page(info, blkno, old_page);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 01c7ef7..5ba8ea0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -420,3 +421,86 @@ gistXLogUpdate(Buffer buffer,
 
 	return recptr;
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+gist_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Mask pages */
+		norm_new_page = mask_gist_page(info, blkno, new_page);
+		norm_old_page = mask_gist_page(info, blkno, old_page);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+		{
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u, block_id %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno, block_id);
+		}
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..3e34c59 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -21,10 +21,12 @@
 #include "access/hash.h"
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
+#include "access/xlogutils.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
+#include "storage/bufmask.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
 
@@ -711,3 +713,84 @@ hash_redo(XLogReaderState *record)
 {
 	elog(PANIC, "hash_redo: unimplemented");
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+hash_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Mask pages */
+		norm_new_page = mask_hash_page(info, blkno, new_page);
+		norm_old_page = mask_hash_page(info, blkno, old_page);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6a27ef4..d56324b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -58,6 +58,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
+#include "storage/bufmask.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -9120,3 +9121,84 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+heap_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Mask pages */
+		norm_new_page = mask_heap_page(info, blkno, new_page);
+		norm_old_page = mask_heap_page(info, blkno, old_page);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..7425a47 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,88 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+btree_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	/* No redo for the following type */
+	if (info == XLOG_BTREE_UNLINK_PAGE)
+		return;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Mask pages */
+		norm_new_page = mask_btree_page(info, blkno, new_page);
+		norm_old_page = mask_btree_page(info, blkno, old_page);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..e972695 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,84 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+spg_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Mask pages */
+		norm_new_page = mask_spg_page(info, blkno, new_page);
+		norm_old_page = mask_spg_page(info, blkno, old_page);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 1926d98..ec55181 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -16,6 +16,7 @@
 #include "access/generic_xlog.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 /*-------------------------------------------------------------------------
@@ -533,3 +534,88 @@ generic_redo(XLogReaderState *record)
 			UnlockReleaseBuffer(buffers[block_id]);
 	}
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+generic_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/*
+		 * At present, generic xlog is used only by bloom index.
+		 * We are masking it as common page. It can be changed
+		 * if required.
+		 */
+		norm_new_page = mask_common_page(info, blkno, new_page, true, true);
+		norm_old_page = mask_common_page(info, blkno, old_page, true, true);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..7e85c2b 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -26,12 +26,13 @@
 #include "commands/tablespace.h"
 #include "replication/message.h"
 #include "replication/origin.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,checkConsistency) \
+	{ name, redo, desc, identify, startup, cleanup, checkConsistency },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2189c22..5ad6228 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -25,6 +25,7 @@
 #include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
+#include "access/rmgr.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
@@ -53,6 +54,8 @@
 #include "replication/walsender.h"
 #include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/bufmask.h"
+#include "storage/bufpage.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/large_object.h"
@@ -95,6 +98,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char		*wal_consistency_string = NULL;
+bool		*wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -6944,6 +6949,14 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with the WAL record
+				 * are consistenct with the existing pages. This check is done only
+				 * if consistency check is enabled for the corresponding rmid.
+				 */
+				if(wal_consistency[record->xl_rmid])
+					RmgrTable[record->xl_rmid].rm_checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -11708,3 +11721,87 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+xlog_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	/* in XLOG rmgr, backup blocks are only used by XLOG_FPI records */
+	if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
+	{
+		old_page = (Page) palloc(BLCKSZ);
+		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		{
+			Buffer buf;
+			char *norm_new_page, *norm_old_page;
+
+			if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+			{
+				/* Caller specified a bogus block_id. Don't do anything. */
+				continue;
+			}
+			/*
+			 * Read the contents from the current buffer
+			 * and store it in a temporary page.
+			 */
+			buf = XLogReadBufferExtended(rnode, forknum, blkno,
+											   RBM_NORMAL);
+			if (!BufferIsValid(buf))
+				continue;
+			new_page = BufferGetPage(buf);
+
+			/*
+			 * Read the contents from the backup copy, stored in WAL record
+			 * and store it in a temporary page. Before restoring, set
+			 * has_image value as true, since RestoreBlockImage checks
+			 * this flag. After restoring the image, restore the value of
+			 * has_image flag.
+			 */
+			has_image = record->blocks[block_id].has_image;
+			record->blocks[block_id].has_image = true;
+			if (!RestoreBlockImage(record, block_id, old_page))
+				elog(ERROR, "failed to restore block image");
+			record->blocks[block_id].has_image = has_image;
+
+			/* Mask pages */
+			norm_new_page = mask_common_page(info, blkno, new_page, false, false);
+			norm_old_page = mask_common_page(info, blkno, old_page, false, false);
+
+			/* Time to compare the old and new contents */
+			inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+			if (inconsistent_loc < BLCKSZ)
+				elog(WARNING,
+					 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+					 "forknum %u, blkno %u", inconsistent_loc,
+					 rnode.spcNode, rnode.dbNode, rnode.relNode,
+					 forknum, blkno);
+			else
+				elog(DEBUG1,
+					 "Consistent page found, rel %u/%u/%u, "
+					 "forknum %u, blkno %u",
+					 rnode.spcNode, rnode.dbNode, rnode.relNode,
+					 forknum, blkno);
+			pfree(norm_new_page);
+			pfree(norm_old_page);
+			ReleaseBuffer(buf);
+		}
+		pfree(old_page);
+	}
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..2f7c36b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -556,7 +556,11 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If wal consistency check is enabled for current rmid,
+		 * we do fpw for the current block.
+		 */
+		if (needs_backup || wal_consistency[rmid])
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -608,7 +612,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 			 * Fill in the remaining fields in the XLogRecordBlockHeader
 			 * struct
 			 */
-			bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
+
+			/*
+			 * Remember that, if WAL consistency check is enabled for the current rmid,
+			 * we always include backup image with the WAL record. If needs_backup is enabled,
+			 * only then set BKPBLOCK_HAS_IMAGE flag. During redo, this flag is used
+			 * to set has_image flag in DecodedBkpBlock. We don't want to set
+			 * this flag unnecessarily, since this will restore the page during redo.
+			 */
+			if (needs_backup)
+				bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
 
 			/*
 			 * Construct XLogRecData entries for the page content.
@@ -680,7 +693,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (needs_backup || wal_consistency[rmid])
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f2da505..2f6b51e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1026,6 +1026,12 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
+	bool checkConsistency = false;
+
+	#ifndef FRONTEND
+	/* Check whether wal consistency check is enabled for the current rmid.*/
+	checkConsistency = wal_consistency[record->xl_rmid];
+	#endif
 
 	ResetDecoder(state);
 
@@ -1114,7 +1120,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			}
 			datatotal += blk->data_len;
 
-			if (blk->has_image)
+			/*
+			 * If wal consistency check is enabled, then it will always
+			 * have a backup image.
+			 */
+			if (blk->has_image || checkConsistency)
 			{
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
@@ -1242,7 +1252,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (!blk->in_use)
 			continue;
-		if (blk->has_image)
+		/*
+		 * If wal consistency check is enabled, then it will always
+		 * have a backup image.
+		 */
+		if (blk->has_image || checkConsistency)
 		{
 			blk->bkp_image = ptr;
 			ptr += blk->bimg_len;
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index c98f981..a7349b2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -49,16 +50,6 @@
 #define SEQ_LOG_VALS	32
 
 /*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
-typedef struct sequence_magic
-{
-	uint32		magic;
-} sequence_magic;
-
-/*
  * We store a SeqTable item for every sequence we have touched in the current
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
@@ -329,7 +320,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 {
 	Buffer		buf;
 	Page		page;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	OffsetNumber offnum;
 
 	/* Initialize first page of relation with special magic number */
@@ -339,9 +330,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	page = BufferGetPage(buf);
 
-	PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
-	sm->magic = SEQ_MAGIC;
+	PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	/* Now insert sequence tuple */
 
@@ -1109,18 +1100,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
 {
 	Page		page;
 	ItemId		lp;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	Form_pg_sequence seq;
 
 	*buf = ReadBuffer(rel, 0);
 	LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buf);
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
 
-	if (sm->magic != SEQ_MAGIC)
+	if (sm->seq_page_id != SEQ_MAGIC)
 		elog(ERROR, "bad magic number in sequence \"%s\": %08X",
-			 RelationGetRelationName(rel), sm->magic);
+			 RelationGetRelationName(rel), sm->seq_page_id);
 
 	lp = PageGetItemId(page, FirstOffsetNumber);
 	Assert(ItemIdIsNormal(lp));
@@ -1585,7 +1576,7 @@ seq_redo(XLogReaderState *record)
 	char	   *item;
 	Size		itemsz;
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 
 	if (info != XLOG_SEQ_LOG)
 		elog(PANIC, "seq_redo: unknown op code %u", info);
@@ -1604,9 +1595,9 @@ seq_redo(XLogReaderState *record)
 	 */
 	localpage = (Page) palloc(BufferGetPageSize(buffer));
 
-	PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(localpage);
-	sm->magic = SEQ_MAGIC;
+	PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	item = (char *) xlrec + sizeof(xl_seq_rec);
 	itemsz = XLogRecGetDataLen(record) - sizeof(xl_seq_rec);
@@ -1638,3 +1629,87 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+seq_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int block_id;
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blkno;
+	int inconsistent_loc;
+	bool has_image;
+	Page new_page, old_page;
+
+	old_page = (Page) palloc(BLCKSZ);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer buf;
+		char *norm_new_page, *norm_old_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Don't do anything. */
+			continue;
+		}
+		/*
+		 * Read the contents from the current buffer
+		 * and store it in a temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										   RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. Before restoring, set
+		 * has_image value as true, since RestoreBlockImage checks
+		 * this flag. After restoring the image, restore the value of
+		 * has_image flag.
+		 */
+		has_image = record->blocks[block_id].has_image;
+		record->blocks[block_id].has_image = true;
+		if (!RestoreBlockImage(record, block_id, old_page))
+			elog(ERROR, "failed to restore block image");
+		record->blocks[block_id].has_image = has_image;
+
+		/* Since, we always reinit the page in seq_redo, there is no need
+		 * to handle any special cases during masking. We can use common
+		 * mask function to mask seq pages.
+		 */
+		norm_new_page = mask_common_page(info, blkno, new_page, true, true);
+		norm_old_page = mask_common_page(info, blkno, old_page, true, true);
+
+		/* Time to compare the old and new contents */
+		inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+		if (inconsistent_loc < BLCKSZ)
+			elog(WARNING,
+				 "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u", inconsistent_loc,
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		else
+			elog(DEBUG1,
+				 "Consistent page found, rel %u/%u/%u, "
+				 "forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		pfree(norm_new_page);
+		pfree(norm_old_page);
+		ReleaseBuffer(buf);
+	}
+	pfree(old_page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..6b86379
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,468 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/brin_page.h"
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/hash.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufmask.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+static void mask_page_lsn(Page page);
+static void mask_page_hint_bits(Page page);
+static void mask_unused_space(Page page);
+
+/*
+ * Mask Page LSN
+ */
+static void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+	PageXLogRecPtrSet(phdr->pd_lsn, 0xFFFFFFFFFFFFFFFF);
+}
+
+/*
+ * Mask Page hint bits
+ */
+static void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int	pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+char *
+mask_heap_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, I set it
+			 * to current block number and offset. Need suggestions!
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+			{
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+			}
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a btree page
+ */
+char *
+mask_btree_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)) - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a hash page
+ */
+char *
+mask_hash_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	HashPageOpaque opaque;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page_norm);
+	/*
+	 * Mask everything on a UNUSED page.
+	 */
+	if (opaque->hasho_flag & LH_UNUSED_PAGE)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)) - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else if ((opaque->hasho_flag & LH_META_PAGE)==0)
+	{
+		/*
+		 * For pages other than metapage,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a SpGist page
+ */
+char *
+mask_spg_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a GIST page
+ */
+char *
+mask_gist_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber offnum,
+				maxoff;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/*Mask NSN*/
+	GistPageSetNSN(page_norm, 0xFFFFFFFFFFFFFFFF);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL record.
+	 * Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a Gin page
+ */
+char *
+mask_gin_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	GinPageOpaque opaque;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a BRIN page
+ */
+char *
+mask_brin_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber offnum,
+				maxoff;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* We need to handle brin pages of type Meta and Revmap if needed */
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a common page
+ */
+char *
+mask_common_page(uint8 info, BlockNumber blkno, const char *page, bool maskHints, bool maskUnusedSpace)
+{
+	Page	page_norm;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	if(maskHints)
+		mask_page_hint_bits(page_norm);
+
+	if(maskUnusedSpace)
+		mask_unused_space(page_norm);
+
+	return (char *)page_norm;
+}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..cc35fc4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1134,3 +1134,47 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, blkno);
 }
+
+/*
+ * Compare the contents of two pages.
+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
+ * it returns the location where the first mismatch has occurred.
+ */
+int
+comparePages(char *page1, char *page2)
+{
+	char	buf1[BLCKSZ * 2];
+	char	buf2[BLCKSZ * 2];
+	int		j = 0;
+	int		i;
+
+	/*
+	 * Convert the pages to be compared into hex format to facilitate
+	 * their comparison and make potential diffs more readable while
+	 * debugging.
+	 */
+	for (i = 0; i < BLCKSZ ; i++)
+	{
+		const char *digits = "0123456789ABCDEF";
+		uint8 byte1 = (uint8) page1[i];
+		uint8 byte2 = (uint8) page2[i];
+
+		buf1[j] = digits[byte1 >> 4];
+		buf2[j] = digits[byte2 >> 4];
+
+		if (buf1[j] != buf2[j])
+		{
+			break;
+		}
+		j++;
+
+		buf1[j] = digits[byte1 & 0x0F];
+		buf2[j] = digits[byte2 & 0x0F];
+		if (buf1[j] != buf2[j])
+		{
+			break;
+		}
+		j++;
+	}
+	return i;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5178f7..71baf0a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -144,6 +144,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3248,6 +3251,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Sets the rmgrIDs for which WAL consistency should be checked."),
+			gettext_noop("Valid values are combinations of rmgrIDs"),
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"NONE",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -3259,6 +3273,7 @@ static struct config_string ConfigureNamesString[] =
 		"stderr",
 		check_log_destination, assign_log_destination, NULL
 	},
+
 	{
 		{"log_directory", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination directory for log files."),
@@ -9903,6 +9918,128 @@ assign_log_destination(const char *newval, void *extra)
 	Log_destination = *((int *) extra);
 }
 
+static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool	*newwalconsistency;
+	int i;
+
+	newwalconsistency = (bool *) guc_malloc(ERROR,(RM_MAX_ID + 1)*sizeof(bool));
+
+	/* Initialize the array*/
+	for(i = 0; i < RM_MAX_ID + 1 ; i++)
+		newwalconsistency[i] = false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *tok = (char *) lfirst(l);
+		if (pg_strcasecmp(tok, "Heap2") == 0)
+		{
+			newwalconsistency[RM_HEAP2_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "Heap") == 0)
+		{
+			newwalconsistency[RM_HEAP_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "Btree") == 0)
+		{
+			newwalconsistency[RM_BTREE_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "Hash") == 0)
+		{
+			newwalconsistency[RM_HASH_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "Gin") == 0)
+		{
+			newwalconsistency[RM_GIN_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "Gist") == 0)
+		{
+			newwalconsistency[RM_GIST_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "Sequence") == 0)
+		{
+			newwalconsistency[RM_SEQ_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "SPGist") == 0)
+		{
+			newwalconsistency[RM_SPGIST_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "BRIN") == 0)
+		{
+			newwalconsistency[RM_BRIN_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "Generic") == 0)
+		{
+			newwalconsistency[RM_GENERIC_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "XLOG") == 0)
+		{
+			newwalconsistency[RM_XLOG_ID] = true;
+		}
+		else if (pg_strcasecmp(tok, "NONE") == 0)
+		{
+			for(i = 0; i < RM_MAX_ID + 1 ; i++)
+				newwalconsistency[i] = false;
+			break;
+		}
+		else if (pg_strcasecmp(tok, "ALL") == 0)
+		{
+			/*
+			 * Followings are the rmids which can have backup blocks.
+			 * We'll enable this feature only for these rmids.
+			 */
+			newwalconsistency[RM_HEAP2_ID] = true;
+			newwalconsistency[RM_HEAP_ID] = true;
+			newwalconsistency[RM_BTREE_ID] = true;
+			newwalconsistency[RM_HASH_ID] = true;
+			newwalconsistency[RM_GIN_ID] = true;
+			newwalconsistency[RM_GIST_ID] = true;
+			newwalconsistency[RM_SEQ_ID] = true;
+			newwalconsistency[RM_SPGIST_ID] = true;
+			newwalconsistency[RM_BRIN_ID] = true;
+			newwalconsistency[RM_GENERIC_ID] = true;
+			newwalconsistency[RM_XLOG_ID] = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	*extra = (void *) newwalconsistency;
+
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
 static void
 assign_syslog_facility(int newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6d0666c..e1f688e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,11 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = 'none'		# Valid values are combinations of
+					# Heap2, Heap, Btree, Hash, Gin, Gist, Sequence,
+					# SPGist, BRIN, Generic and XLOG. It can also
+					# be set to ALL to enable all the values.
+					# (change requires restart)
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index b53591d..e9c7914 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,check) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..8418281 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,check) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..d99dd42 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_checkConsistency(XLogReaderState *record);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h
index 63f2120..a8ecd35 100644
--- a/src/include/access/generic_xlog.h
+++ b/src/include/access/generic_xlog.h
@@ -40,5 +40,6 @@ extern void GenericXLogAbort(GenericXLogState *state);
 extern void generic_redo(XLogReaderState *record);
 extern const char *generic_identify(uint8 info);
 extern void generic_desc(StringInfo buf, XLogReaderState *record);
+extern void generic_checkConsistency(XLogReaderState *record);
 
 #endif   /* GENERIC_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..c5e80fd 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -80,4 +80,5 @@ extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
 
+extern void gin_checkConsistency(XLogReaderState *record);
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 1231585..3ad246b 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -464,6 +464,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_checkConsistency(XLogReaderState *record);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..28f8aca 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -21,5 +21,6 @@
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
+extern void hash_checkConsistency(XLogReaderState *record);
 
 #endif   /* HASH_XLOG_H */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..c52e27c 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -398,4 +398,5 @@ extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
+extern void heap_checkConsistency(XLogReaderState *record);
 #endif   /* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..8e5f1fc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -776,4 +776,5 @@ extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
 
+extern void btree_checkConsistency(XLogReaderState *record);
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..3e6d014 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,checkConsistency) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..9ff80f3 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, xlog_checkConsistency)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_checkConsistency)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_checkConsistency)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_checkConsistency)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_checkConsistency)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_checkConsistency)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_checkConsistency)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_checkConsistency)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_checkConsistency)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_checkConsistency)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_checkConsistency)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..edd224c 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_checkConsistency(XLogReaderState *record);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..d19b9ec 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
@@ -274,6 +276,8 @@ extern void XLogRequestWalReceiverReply(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
+extern void xlog_checkConsistency(XLogReaderState *record);
+
 /*
  * Starting/stopping a base backup
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 0a595cc..e9d210f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -276,6 +276,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_checkConsistency) (XLogReaderState *record);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..287143b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,8 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		require_image; /* This field contains the true value of has_image.
+					Because, if wal consistency check is enabled, has_image will always be true.*/
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..34e28c0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -137,7 +137,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
-
+#define BKPIMAGE_IS_REQUIRED		0x04	/* page is required by the WAL record */
 /*
  * Extra header information used when page image has "hole" and
  * is compressed.
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 6af60d8..26895fc 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -20,6 +20,19 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * Page opaque data in a sequence page
+ */
+typedef struct SequencePageOpaqueData
+{
+	uint32 seq_page_id;
+} SequencePageOpaqueData;
+
+/*
+ * This page ID is for the conveniende to be able to identify if a page
+ * is being used by a sequence.
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
@@ -81,5 +94,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_checkConsistency(XLogReaderState *record);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..b8d850a
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+
+/* Entry point for page masking */
+extern char *mask_page(RmgrIds rmid, uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_heap_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_btree_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_hash_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_spg_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_gist_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_gin_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_brin_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_common_page(uint8 info, BlockNumber blkno, const char *page, bool maskHints, bool maskUnusedSpace);
+#endif
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..b754134 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -432,4 +432,5 @@ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
 
+extern int comparePages(Page norm_new_page, Page norm_old_page);
 #endif   /* BUFPAGE_H */
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index fd71095..3050dd8 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -8,6 +8,10 @@ use Test::More tests => 4;
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
+$node_master->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_master->start;
 my $backup_name = 'my_backup';
 
@@ -18,6 +22,10 @@ $node_master->backup($backup_name);
 my $node_standby_1 = get_new_node('standby_1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+$node_standby_1->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_1->start;
 
 # Take backup of standby 1 (not mandatory, but useful to check if
@@ -28,6 +36,10 @@ $node_standby_1->backup($backup_name);
 my $node_standby_2 = get_new_node('standby_2');
 $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
+$node_standby_2->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_2->start;
 
 # Create some content on master and check its presence in standby 1
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index fc2bf7e..ed9da1d 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -11,6 +11,10 @@ my $node_master = get_new_node('master');
 $node_master->init(
 	has_archiving    => 1,
 	allows_streaming => 1);
+$node_master->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 my $backup_name = 'my_backup';
 
 # Start it
@@ -27,6 +31,10 @@ $node_standby->append_conf(
 	'postgresql.conf', qq(
 wal_retrieve_retry_interval = '100ms'
 ));
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby->start;
 
 # Create some content on master
diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index a82545b..6452086 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -27,7 +27,10 @@ sub test_recovery_standby
 			qq($param_item
 ));
 	}
-
+	$node_standby->append_conf(
+		'postgresql.conf', qq(
+	wal_consistency = 'All'
+	));
 	$node_standby->start;
 
 	# Wait until standby has replayed enough data
@@ -48,7 +51,10 @@ sub test_recovery_standby
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(has_archiving => 1, allows_streaming => 1);
-
+$node_master->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 # Start it
 $node_master->start;
 
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index 3ee8df2..42c4257 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -13,6 +13,10 @@ $ENV{PGDATABASE} = 'postgres';
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
+$node_master->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_master->start;
 
 # Take backup
@@ -23,10 +27,18 @@ $node_master->backup($backup_name);
 my $node_standby_1 = get_new_node('standby_1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+$node_standby_1->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_1->start;
 my $node_standby_2 = get_new_node('standby_2');
 $node_standby_2->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+$node_standby_2->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_2->start;
 
 # Create some content on master
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 640295b..b782cc2 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -9,6 +9,10 @@ use Test::More tests => 1;
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
+$node_master->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_master->start;
 
 # And some content
@@ -28,6 +32,10 @@ $node_standby->append_conf(
 	'recovery.conf', qq(
 recovery_min_apply_delay = '${delay}s'
 ));
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby->start;
 
 # Make new content on master and check its presence in standby depending
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index b80a9a9..63a10c4 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -13,6 +13,10 @@ $node_master->append_conf(
 max_replication_slots = 4
 wal_level = logical
 ));
+$node_master->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_master->start;
 my $backup_name = 'master_backup';
 
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..5911d65 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -46,6 +46,10 @@ sub test_sync_state
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
+$node_master->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_master->start;
 my $backup_name = 'master_backup';
 
@@ -56,18 +60,30 @@ $node_master->backup($backup_name);
 my $node_standby_1 = get_new_node('standby1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+$node_standby_1->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_1->start;
 
 # Create standby2 linking to master
 my $node_standby_2 = get_new_node('standby2');
 $node_standby_2->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+$node_standby_2->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_2->start;
 
 # Create standby3 linking to master
 my $node_standby_3 = get_new_node('standby3');
 $node_standby_3->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+$node_standby_3->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_3->start;
 
 # Check that sync_state is determined correctly when
@@ -116,6 +132,10 @@ $node_standby_1->start;
 my $node_standby_4 = get_new_node('standby4');
 $node_standby_4->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
+$node_standby_4->append_conf(
+	'postgresql.conf', qq(
+wal_consistency = 'All'
+));
 $node_standby_4->start;
 
 # Check that standby1 and standby2 whose names appear earlier in
#38Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Kuntal Ghosh (#37)
Re: WAL consistency check facility

On Wed, Sep 7, 2016 at 3:52 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Hello,

As per the earlier discussions, I've attached the updated patch for
WAL consistency check feature. This is how the patch works:

The earlier patch (wal_consistency_v6.patch) was based on the commit
id 67e1e2aaff.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Kuntal Ghosh (#37)
Re: WAL consistency check facility

On Wed, Sep 7, 2016 at 3:52 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

I got two types of inconsistencies as following:

1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
redo, only BTP_DELETED flag is set in buffer page.

I see that inconsistency in code as well. I think this is harmless,
because after the page is marked as deleted, it is not used for any
purpose other than to recycle it for re-use. After re-using it, the
caller always suppose to initialize the flags based on it's usage and
I see that is happening in the code unless I am missing something.

I assume that we
should clear all btpo_flags before setting BTP_DELETED in
_bt_unlink_halfdead_page().

Yeah, we can do that for consistency. If we see any problem in doing
so, then I think we can log the flags and set them during replay.

Note - Please post your replies inline rather than top posting them.
It breaks the discussion link, if you top post it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#37)
Re: WAL consistency check facility

On Wed, Sep 7, 2016 at 7:22 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Hello,

Could you avoid top-posting please? More reference here:
http://www.idallen.com/topposting.html

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

- I've extended the RmgrTable with a new function pointer
rm_checkConsistency, which is called after rm_redo. (only when WAL
consistency check is enabled for this rmgrID)
- In each rm_checkConsistency, both backup pages and buffer pages are
masked accordingly before any comparison.

This leads to heavy code duplication...

- In postgresql.conf, a new guc variable named 'wal_consistency' is
added. Default value of this variable is 'None'. Valid values are
combinations of Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist,
BRIN, Generic and XLOG. It can also be set to 'All' to enable all the
values.

Lower-case is the usual policy for parameter values for GUC parameters.

- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

I am afraid that just generating a WARNING message is going to be
useless for the buildfarm. If we want to detect errors, we could for
example have an additional GUC to trigger an ERROR or a FATAL, taking
down the cluster, and allowing things to show in red on a platform.

Results
------------------------

I've tested with installcheck and installcheck-world in master-standby
set-up. Followings are the configuration parameters.

So you tested as well the recovery tests, right?

I got two types of inconsistencies as following:

1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
redo, only BTP_DELETED flag is set in buffer page. I assume that we
should clear all btpo_flags before setting BTP_DELETED in
_bt_unlink_halfdead_page().

The page is deleted, it does not matter, so you could just mask all
the flags for a deleted page...
[...]
+   /*
+    * Mask everything on a DELETED page.
+    */
+   if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags
& BTP_DELETED)
And that's what is happening.

2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
REVMAP page. This happens only for two cases. I'm not sure what the
reason can be.

Hm? This smells like a block reference bug. What are the cases you are
referring to?

I haven't done sufficient tests yet to measure the overhead of this
modification. I'll do that next.

I did a first pass on your patch, and I think that things could be
really reduced. There is much code duplication, but see below for the
details..

#include "access/xlogutils.h"
-
+#include "storage/bufmask.h"
I know that I am a noisy one on the matter, but please double-check
for such useless noise in your patch. And there is not only one.

+ newwalconsistency = (bool *) guc_malloc(ERROR,(RM_MAX_ID + 1)*sizeof(bool));
This spacing is not project-style. You may want to go through that:
https://www.postgresql.org/docs/devel/static/source.html

+$node_master->append_conf(
+   'postgresql.conf', qq(
+wal_consistency = 'All'
+));
Instead of duplicating that 7 times, you could just do it once in the
init() method of PostgresNode.pm. This really has meaning if enabled
by default.
+           /*
+            * Followings are the rmids which can have backup blocks.
+            * We'll enable this feature only for these rmids.
+            */
+           newwalconsistency[RM_HEAP2_ID] = true;
+           newwalconsistency[RM_HEAP_ID] = true;
+           newwalconsistency[RM_BTREE_ID] = true;
+           newwalconsistency[RM_HASH_ID] = true;
+           newwalconsistency[RM_GIN_ID] = true;
+           newwalconsistency[RM_GIST_ID] = true;
+           newwalconsistency[RM_SEQ_ID] = true;
+           newwalconsistency[RM_SPGIST_ID] = true;
+           newwalconsistency[RM_BRIN_ID] = true;
+           newwalconsistency[RM_GENERIC_ID] = true;
+           newwalconsistency[RM_XLOG_ID] = true;
Here you can just use MemSet with RM_MAX_ID and simplify this code maintenance.
+           for(i = 0; i < RM_MAX_ID + 1 ; i++)
+               newwalconsistency[i] = false;
+           break;
Same here you can just use MemSet.
+       else if (pg_strcasecmp(tok, "NONE") == 0)
[...]
+       else if (pg_strcasecmp(tok, "ALL") == 0)
It seems to me that using NONE or ALL with any other keywords should
not be allowed.
+       if (pg_strcasecmp(tok, "Heap2") == 0)
+       {
+           newwalconsistency[RM_HEAP2_ID] = true;
+       }
Thinking more about it, I guess that we had better change the
definition list of rmgrs in rmgr.h and get something closer to
RmgrDescData that pg_xlogdump has to avoid all this stanza by
completing it with the name of the rmgr. The only special cases that
this code path would need to take care of would be then 'none' and
'all'. You could do this refactoring on top of the main patch to
simplify it as it is rather big (1.7k lines).
+       if (inconsistent_loc < BLCKSZ)
+           elog(WARNING,
+                "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+                "forknum %u, blkno %u", inconsistent_loc,
+                rnode.spcNode, rnode.dbNode, rnode.relNode,
+                forknum, blkno);
+       else
+           elog(DEBUG1,
+                "Consistent page found, rel %u/%u/%u, "
+                "forknum %u, blkno %u",
+                rnode.spcNode, rnode.dbNode, rnode.relNode,
+                forknum, blkno);
This is going to be very chatty. Perhaps the elog level should be raised?

-#define SEQ_MAGIC 0x1717
-
-typedef struct sequence_magic
-{
- uint32 magic;
-} sequence_magic;
You do not need this refactoring anymore.

+   void        (*rm_checkConsistency) (XLogReaderState *record);
All your _checkConsistency functions share the same pattern, in short
they all use a for loop for each block, call each time
XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
You would get a reduction by a couple of hundreds of lines by having a
smarter refactoring. And to be honest, if I look at your patch what I
think is the correct way of doing things is to add to the rmgr not
this check consistency function, but just a pointer to the masking
function.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#40)
Re: WAL consistency check facility

Hello Michael,

Thanks for your detailed review.

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

It is possible to set wal_consistency to 'All' in master and any other
values in standby. But, the scenario you mentioned will cause error in
standby since it may not get the required backup image for wal
consistency check. I think that user should be responsible to set
this value correctly. We can improve the error message to make the
user aware of the situation.

- I've extended the RmgrTable with a new function pointer
rm_checkConsistency, which is called after rm_redo. (only when WAL
consistency check is enabled for this rmgrID)
- In each rm_checkConsistency, both backup pages and buffer pages are
masked accordingly before any comparison.

This leads to heavy code duplication...

+ void (*rm_checkConsistency) (XLogReaderState *record);
All your _checkConsistency functions share the same pattern, in short
they all use a for loop for each block, call each time
XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
You would get a reduction by a couple of hundreds of lines by having a
smarter refactoring. And to be honest, if I look at your patch what I
think is the correct way of doing things is to add to the rmgr not
this check consistency function, but just a pointer to the masking
function.

Pointer to the masking function will certainly reduce a lot of redundant
code. I'll modify it accordingly.

- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

I am afraid that just generating a WARNING message is going to be
useless for the buildfarm. If we want to detect errors, we could for
example have an additional GUC to trigger an ERROR or a FATAL, taking
down the cluster, and allowing things to show in red on a platform.

Yes, we can include an additional GUC to trigger an ERROR for any inconsistency.

Results
------------------------

I've tested with installcheck and installcheck-world in master-standby
set-up. Followings are the configuration parameters.

So you tested as well the recovery tests, right?

Yes, I've done the recovery tests after enabling tap-test.

+           /*
+            * Followings are the rmids which can have backup blocks.
+            * We'll enable this feature only for these rmids.
+            */
+           newwalconsistency[RM_HEAP2_ID] = true;
+           newwalconsistency[RM_HEAP_ID] = true;
+           newwalconsistency[RM_BTREE_ID] = true;
+           newwalconsistency[RM_HASH_ID] = true;
+           newwalconsistency[RM_GIN_ID] = true;
+           newwalconsistency[RM_GIST_ID] = true;
+           newwalconsistency[RM_SEQ_ID] = true;
+           newwalconsistency[RM_SPGIST_ID] = true;
+           newwalconsistency[RM_BRIN_ID] = true;
+           newwalconsistency[RM_GENERIC_ID] = true;
+           newwalconsistency[RM_XLOG_ID] = true;
Here you can just use MemSet with RM_MAX_ID and simplify this code maintenance.

Not all rmids can have backup blocks. So, for wal_consistency = 'all',
I've enabled only those rmids which can have backup blocks.

+       if (pg_strcasecmp(tok, "Heap2") == 0)
+       {
+           newwalconsistency[RM_HEAP2_ID] = true;
+       }
Thinking more about it, I guess that we had better change the
definition list of rmgrs in rmgr.h and get something closer to
RmgrDescData that pg_xlogdump has to avoid all this stanza by
completing it with the name of the rmgr. The only special cases that
this code path would need to take care of would be then 'none' and
'all'. You could do this refactoring on top of the main patch to
simplify it as it is rather big (1.7k lines).

I'm not sure about this point. wal_consistency doesn't support all
the rmids. We should have some way to check this.

I'll update rest of the things as mentioned by you accordingly.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#41)
Re: WAL consistency check facility

On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

It is possible to set wal_consistency to 'All' in master and any other
values in standby. But, the scenario you mentioned will cause error in
standby since it may not get the required backup image for wal
consistency check. I think that user should be responsible to set
this value correctly. We can improve the error message to make the
user aware of the situation.

Let's be careful here. You should as well consider things from the
angle that some parameter updates are WAL-logged as well, like
wal_level with the WAL record XLOG_PARAMETER_CHANGE.

- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

I am afraid that just generating a WARNING message is going to be
useless for the buildfarm. If we want to detect errors, we could for
example have an additional GUC to trigger an ERROR or a FATAL, taking
down the cluster, and allowing things to show in red on a platform.

Yes, we can include an additional GUC to trigger an ERROR for any inconsistency.

I'd like to hear extra opinions about that, but IMO just having an
ERROR would be fine for the first implementation. Once you've bumped
into an ERROR, you are likely going to fix it first.

+           /*
+            * Followings are the rmids which can have backup blocks.
+            * We'll enable this feature only for these rmids.
+            */
+           newwalconsistency[RM_HEAP2_ID] = true;
+           newwalconsistency[RM_HEAP_ID] = true;
+           newwalconsistency[RM_BTREE_ID] = true;
+           newwalconsistency[RM_HASH_ID] = true;
+           newwalconsistency[RM_GIN_ID] = true;
+           newwalconsistency[RM_GIST_ID] = true;
+           newwalconsistency[RM_SEQ_ID] = true;
+           newwalconsistency[RM_SPGIST_ID] = true;
+           newwalconsistency[RM_BRIN_ID] = true;
+           newwalconsistency[RM_GENERIC_ID] = true;
+           newwalconsistency[RM_XLOG_ID] = true;
Here you can just use MemSet with RM_MAX_ID and simplify this code maintenance.

Not all rmids can have backup blocks. So, for wal_consistency = 'all',
I've enabled only those rmids which can have backup blocks.

Even if some rmgrs do not support FPWs, I don't think that it is safe
to assume that the existing ones would never support it. Imagine for
example that feature X is implemented. Feature X adds rmgs Y, but rmgr
Y does not use FPWs. At a later point a new feature is added, which
makes rmgr Y using FPWs. We'd increase the number of places to update
with your patch, increasing the likelyness to introduce bugs. It would
be better to use a safe implementation from the maintenance point of
view to be honest (maintenance load of masking functions is somewhat
leveraged by the fact that on-disk format is kept compatible).

+       if (pg_strcasecmp(tok, "Heap2") == 0)
+       {
+           newwalconsistency[RM_HEAP2_ID] = true;
+       }
Thinking more about it, I guess that we had better change the
definition list of rmgrs in rmgr.h and get something closer to
RmgrDescData that pg_xlogdump has to avoid all this stanza by
completing it with the name of the rmgr. The only special cases that
this code path would need to take care of would be then 'none' and
'all'. You could do this refactoring on top of the main patch to
simplify it as it is rather big (1.7k lines).

I'm not sure about this point. wal_consistency doesn't support all
the rmids. We should have some way to check this.

I'd rather see this code done in such a way that all the rmgrs can be
handled, this approach being particularly attractive for the fact that
there is no need to change it if new rmgrs are added in the future.
(This was a reason as well why I still think that a simple on/off
switch would be plain enough, users have mostly control of the SQLs
triggering WAL. And if you run tests, you'll likely have the mind to
turn autovacuum to off to avoid it to generate FPWs and pollute the
logs at least at the second run of your tests).

And if you move forward with the approach of making this parameter a
list, I think that it would be better to add a section in the WAL
documentation about resource managers, like what they are, and list
them in this section of the docs. Then your parameter could link to
this documentation part and users would be able to see what kind of
values can be set. This leverages the need to update multiple portions
of the docs if rmgrs are added or removed in the future, as well as it
minimizes the maintenance of this code.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#42)
Re: WAL consistency check facility

On Sat, Sep 10, 2016 at 3:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

It is possible to set wal_consistency to 'All' in master and any other
values in standby. But, the scenario you mentioned will cause error in
standby since it may not get the required backup image for wal
consistency check. I think that user should be responsible to set
this value correctly. We can improve the error message to make the
user aware of the situation.

Let's be careful here. You should as well consider things from the
angle that some parameter updates are WAL-logged as well, like
wal_level with the WAL record XLOG_PARAMETER_CHANGE.

It seems entirely unnecessary for the master and the standby to agree
here. I think what we need is two GUCs. One of them, which affects
only the master, controls whether the validation information is
including in the WAL, and the other, which affects only the standby,
affects whether validation is performed when the necessary information
is present. Or maybe skip the second one and just decree that
standbys will always validate if the necessary information is present.
Using the same GUC on both the master and the standby but making it
mean different things in each of those places (whether to log the
validation info in one case, whether to perform validation in the
other case) is another option that also avoids needing to enforce that
the setting is the same in both places, but probably an inferior one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#42)
Re: WAL consistency check facility

On Sat, Sep 10, 2016 at 12:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

I am afraid that just generating a WARNING message is going to be
useless for the buildfarm. If we want to detect errors, we could for
example have an additional GUC to trigger an ERROR or a FATAL, taking
down the cluster, and allowing things to show in red on a platform.

Yes, we can include an additional GUC to trigger an ERROR for any inconsistency.

I'd like to hear extra opinions about that, but IMO just having an
ERROR would be fine for the first implementation. Once you've bumped
into an ERROR, you are likely going to fix it first.

+1 for just an ERROR to detect the inconsistency. I think adding
additional GUC just to raise error level doesn't seem to be advisable.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#43)
Re: WAL consistency check facility

On Sat, Sep 10, 2016 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Sep 10, 2016 at 3:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

It is possible to set wal_consistency to 'All' in master and any other
values in standby. But, the scenario you mentioned will cause error in
standby since it may not get the required backup image for wal
consistency check. I think that user should be responsible to set
this value correctly. We can improve the error message to make the
user aware of the situation.

Let's be careful here. You should as well consider things from the
angle that some parameter updates are WAL-logged as well, like
wal_level with the WAL record XLOG_PARAMETER_CHANGE.

It seems entirely unnecessary for the master and the standby to agree
here. I think what we need is two GUCs. One of them, which affects
only the master, controls whether the validation information is
including in the WAL, and the other, which affects only the standby,
affects whether validation is performed when the necessary information
is present.

I think from the clarity perspective, this option sounds good, but I
am slightly afraid that it might be inconvenient for users to set the
different values for these two parameters.

Or maybe skip the second one and just decree that
standbys will always validate if the necessary information is present.

Sounds like a better alternative and probably easier to configure for users.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Amit Kapila (#45)
1 attachment(s)
Re: WAL consistency check facility

Hello,

Based on the previous discussions, I've modified the existing patch.

+ void (*rm_checkConsistency) (XLogReaderState *record);
All your _checkConsistency functions share the same pattern, in short
they all use a for loop for each block, call each time
XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
You would get a reduction by a couple of hundreds of lines by having a
smarter refactoring. And to be honest, if I look at your patch what I
think is the correct way of doing things is to add to the rmgr not
this check consistency function, but just a pointer to the masking
function.

+1. In rmgrlist, I've added a pointer to the masking function for each rmid.
A common function named checkConsistency calls these masking functions
based on their rmid and does comparison for each block.

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

If wal_consistency is enabled for a rmid, standby will always check whether
backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
(I guess Amit and Robert also suggested the same in the thread)
Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
image is required during redo. When we decode a wal record, has_image
flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.

+       if (pg_strcasecmp(tok, "Heap2") == 0)
+       {
+           newwalconsistency[RM_HEAP2_ID] = true;
+       }
Thinking more about it, I guess that we had better change the
definition list of rmgrs in rmgr.h and get something closer to
RmgrDescData that pg_xlogdump has to avoid all this stanza by
completing it with the name of the rmgr. The only special cases that
this code path would need to take care of would be then 'none' and
'all'. You could do this refactoring on top of the main patch to
simplify it as it is rather big (1.7k lines).

I've modified it exactly like pg_xlogdump does. Additionally, it checks
whether masking function is defined for the rmid or not. Hence, in future,
if we want to include any other rmid for wal consistency check, we just need
to define its masking function.

- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

I am afraid that just generating a WARNING message is going to be
useless for the buildfarm. If we want to detect errors, we could for
example have an additional GUC to trigger an ERROR or a FATAL, taking
down the cluster, and allowing things to show in red on a platform.

For now, I've kept this as a WARNING message to detect all inconsistencies
at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v7_base_commit_ID_40b449a.patchtext/x-patch; charset=US-ASCII; name=walconsistency_v7_base_commit_ID_40b449a.patchDownload
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..3ca64d1 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -26,12 +26,13 @@
 #include "commands/tablespace.h"
 #include "replication/message.h"
 #include "replication/origin.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
+	{ name, redo, desc, identify, startup, cleanup, maskPage },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2189c22..77e79f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -25,6 +25,7 @@
 #include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
+#include "access/rmgr.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
@@ -95,6 +96,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char		*wal_consistency_string = NULL;
+bool		*wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -870,6 +873,7 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void checkConsistency(RmgrId rmid, XLogReaderState *record);
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6944,6 +6948,14 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with the WAL record
+				 * are consistent with the existing pages. This check is done only
+				 * if consistency check is enabled for the corresponding rmid.
+				 */
+				if (wal_consistency[record->xl_rmid])
+					checkConsistency(record->xl_rmid, xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -11708,3 +11720,80 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+static void
+checkConsistency(RmgrId rmid, XLogReaderState *record)
+{
+	uint8           info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	Page		new_page, old_page;
+	int		block_id;
+	int		inconsistent_loc;
+	bool		has_image;
+
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		old_page = (Page) palloc(BLCKSZ);
+
+		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		{
+			Buffer buf;
+			char *norm_new_page, *norm_old_page;
+
+			if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+			{
+				/* Caller specified a bogus block_id. Don't do anything. */
+				continue;
+			}
+			/*
+			 * Read the contents from the current buffer
+			 * and store it in a temporary page.
+			 */
+			buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+			if (!BufferIsValid(buf))
+				continue;
+			new_page = BufferGetPage(buf);
+
+			/*
+			 * Read the contents from the backup copy, stored in WAL record
+			 * and store it in a temporary page. Before restoring, set
+			 * has_image value as true, since RestoreBlockImage checks
+			 * this flag. After restoring the image, restore the value of
+			 * has_image flag.
+			 */
+			has_image = record->blocks[block_id].has_image;
+			record->blocks[block_id].has_image = true;
+			if (!RestoreBlockImage(record, block_id, old_page))
+				elog(ERROR, "failed to restore block image");
+			record->blocks[block_id].has_image = has_image;
+
+			/* Mask pages */
+			norm_new_page = RmgrTable[rmid].rm_maskPage(info, blkno, new_page);
+			norm_old_page = RmgrTable[rmid].rm_maskPage(info, blkno, old_page);
+
+			/* Time to compare the old and new contents */
+			inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+			if (inconsistent_loc < BLCKSZ)
+				elog(WARNING,
+					"Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+					"forknum %u, blkno %u", inconsistent_loc,
+					rnode.spcNode, rnode.dbNode, rnode.relNode,
+					forknum, blkno);
+			else
+				elog(DEBUG3,
+					"Consistent page found, rel %u/%u/%u, "
+					"forknum %u, blkno %u",
+					rnode.spcNode, rnode.dbNode, rnode.relNode,
+					forknum, blkno);
+
+			pfree(norm_new_page);
+			pfree(norm_old_page);
+			ReleaseBuffer(buf);
+		}
+		pfree(old_page);
+	}
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..54308dd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -513,6 +513,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image; /* Whether backup image should be included in WAL record*/
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +557,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for current rmid,
+		 * we do a fpw for the current block.
+		 */
+		include_image = needs_backup || wal_consistency[rmid];
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -618,6 +625,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * Remember that, if WAL consistency check is enabled for the current rmid,
+			 * we always include backup image with the WAL record. But, during redo we
+			 * restore the backup block only if needs_backup is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -680,7 +694,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f2da505..f41f92a 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1026,6 +1026,12 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
+	bool checkConsistency = false;
+
+	#ifndef FRONTEND
+	/* Check whether wal consistency check is enabled for the current rmid.*/
+	checkConsistency = wal_consistency[record->xl_rmid];
+	#endif
 
 	ResetDecoder(state);
 
@@ -1114,11 +1120,29 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			}
 			datatotal += blk->data_len;
 
+			/*
+			 * cross check that has_image is set if wal consistency check
+			 * is enabled for current rmid.
+			 */
+			if (checkConsistency && !blk->has_image)
+			{
+				report_invalid_record(state,
+				 "WAL consistency check is enabled, but BKPBLOCK_HAS_IMAGE not set at %X/%X",
+									  (uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
+				goto err;
+			}
+
 			if (blk->has_image)
 			{
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+				/*
+				 * During redo, backup image is restored if has_image is set. Hence,
+				 * set has_image accordingly.
+				 */
+				blk->has_image = blk->bimg_info & BKPIMAGE_IS_REQUIRED_FOR_REDO;
+
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1242,7 +1266,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (!blk->in_use)
 			continue;
-		if (blk->has_image)
+		/*
+		 * If wal consistency check is enabled for current rmid, then it will always
+		 * have a backup image.
+		 */
+		if (blk->has_image || checkConsistency)
 		{
 			blk->bkp_image = ptr;
 			ptr += blk->bimg_len;
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..d380032
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,498 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/brin_page.h"
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/hash.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufmask.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+static void mask_page_lsn(Page page);
+static void mask_page_hint_bits(Page page);
+static void mask_unused_space(Page page);
+static char *mask_common_page(uint8 info, BlockNumber blkno,
+					const char *page, bool maskHints,
+					bool maskUnusedSpace);
+
+/*
+ * Mask Page LSN
+ */
+static void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+	PageXLogRecPtrSet(phdr->pd_lsn, 0xFFFFFFFFFFFFFFFF);
+}
+
+/*
+ * Mask Page hint bits
+ */
+static void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a xlog page
+ */
+char *
+mask_xlog_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	return mask_common_page(info, blkno, page, false, false);
+}
+
+/*
+ * Mask a heap page
+ */
+char *
+mask_heap_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to current block number and offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+			{
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+			}
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a btree page
+ */
+char *
+mask_btree_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a hash page
+ */
+char *
+mask_hash_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	HashPageOpaque opaque;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page_norm);
+	/*
+	 * Mask everything on a UNUSED page.
+	 */
+	if (opaque->hasho_flag & LH_UNUSED_PAGE)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)) - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else if ((opaque->hasho_flag & LH_META_PAGE)== 0)
+	{
+		/*
+		 * For pages other than metapage,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a Gin page
+ */
+char *
+mask_gin_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	GinPageOpaque opaque;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a GIST page
+ */
+char *
+mask_gist_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber offnum,
+				maxoff;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/*Mask NSN*/
+	GistPageSetNSN(page_norm, 0xFFFFFFFFFFFFFFFF);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL record.
+	 * Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a Sequence page
+ */
+char *
+mask_seq_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	return mask_common_page(info, blkno, page, false, true);
+}
+
+/*
+ * Mask a SpGist page
+ */
+char *
+mask_spg_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a BRIN page
+ */
+char *
+mask_brin_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber offnum,
+				maxoff;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* We need to handle brin pages of type Meta and Revmap if needed */
+
+	return (char *)page_norm;
+}
+
+/*
+ * Mask a generic page
+ */
+char *
+mask_generic_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	return mask_common_page(info, blkno, page, true, true);
+}
+
+/*
+ * Mask a common page
+ */
+static char *
+mask_common_page(uint8 info, BlockNumber blkno, const char *page, bool maskHints, bool maskUnusedSpace)
+{
+	Page	page_norm;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	if (maskHints)
+		mask_page_hint_bits(page_norm);
+
+	if (maskUnusedSpace)
+		mask_unused_space(page_norm);
+
+	return (char *)page_norm;
+}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 1b70bfb..bf89dc6 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1141,3 +1141,47 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, blkno);
 }
+
+/*
+ * Compare the contents of two pages.
+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
+ * it returns the location where the first mismatch has occurred.
+ */
+int
+comparePages(char *page1, char *page2)
+{
+	char	buf1[BLCKSZ * 2];
+	char	buf2[BLCKSZ * 2];
+	int		j = 0;
+	int		i;
+
+	/*
+	 * Convert the pages to be compared into hex format to facilitate
+	 * their comparison and make potential diffs more readable while
+	 * debugging.
+	 */
+	for (i = 0; i < BLCKSZ ; i++)
+	{
+		const char *digits = "0123456789ABCDEF";
+		uint8 byte1 = (uint8) page1[i];
+		uint8 byte2 = (uint8) page2[i];
+
+		buf1[j] = digits[byte1 >> 4];
+		buf2[j] = digits[byte2 >> 4];
+
+		if (buf1[j] != buf2[j])
+		{
+			break;
+		}
+		j++;
+
+		buf1[j] = digits[byte1 & 0x0F];
+		buf2[j] = digits[byte2 & 0x0F];
+		if (buf1[j] != buf2[j])
+		{
+			break;
+		}
+		j++;
+	}
+	return i;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5178f7..04d8b3d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -144,6 +146,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3248,6 +3253,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Sets the rmgrIDs for which WAL consistency should be checked."),
+			gettext_noop("Valid values are combinations of rmgrIDs"),
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"none",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9903,6 +9919,121 @@ assign_log_destination(const char *newval, void *extra)
 	Log_destination = *((int *) extra);
 }
 
+static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   	*rawstring;
+	List	   	*elemlist;
+	ListCell   	*l;
+	bool		*newwalconsistency;
+	bool		isRmgrId = false;	/* Does this guc include any individual rmid? */
+	bool		isAll = false;	/* Does this guc include 'all' keyword? */
+	bool		isNone = false;	/* Does this guc include 'none' keyword? */
+	int		i;
+
+	newwalconsistency = (bool *) guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Initialize the array*/
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char		*tok = (char *) lfirst(l);
+		bool		found = false;
+
+		/* Check if the token matches with any individual rmid */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if maskPage function
+				 * is defined for this rmid.
+				 */
+				if (RmgrTable[i].rm_maskPage != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+
+		if (found)
+			continue;
+
+		/* Definitely not an individual rmid. Check for 'none' and 'all'. */
+		if (pg_strcasecmp(tok, "none") == 0)
+		{
+			MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+			isNone = true;
+		}
+		else if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * Followings are the rmids which can have backup blocks.
+			 * We'll enable this feature only for these rmids.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_maskPage != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/* guc should contain either 'all' or 'none' or combination of rmids. */
+	if ((isAll && isNone) || (isAll && isRmgrId) || (isNone && isRmgrId))
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	*extra = (void *) newwalconsistency;
+
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
 static void
 assign_syslog_facility(int newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6d0666c..9ccc9c2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,11 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = 'none'		# Valid values are combinations of
+					# heap2, heap, btree, hash, gin, gist, sequence,
+					# spgist, brin, generic and xlog. It can also
+					# be set to all to enable all the values.
+					# (change requires restart)
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index b53591d..baeeecc 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..f962e79 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..0d2bc1a 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..9e693b4 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, mask_xlog_page)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, mask_heap_page)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, mask_heap_page)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, mask_btree_page)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, mask_hash_page)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, mask_gin_page)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, mask_gist_page)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, mask_seq_page)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, mask_spg_page)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, mask_brin_page)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, mask_generic_page)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..295bf09 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 0a595cc..47fb0d0 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -276,6 +276,8 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	char		*(*rm_maskPage) (uint8 info, BlockNumber blkno,
+							const char *page);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..d747ab1 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -137,6 +137,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_IS_REQUIRED_FOR_REDO		0x04	/* page is required during redo */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..b7be1e8
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+
+extern char *mask_xlog_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_heap_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_btree_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_hash_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_gin_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_gist_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_seq_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_spg_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_brin_page(uint8 info, BlockNumber blkno, const char *page);
+extern char *mask_generic_page(uint8 info, BlockNumber blkno, const char *page);
+#endif
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..8ca98e4 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -432,4 +432,6 @@ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
 
+extern int comparePages(Page norm_new_page, Page norm_old_page);
+
 #endif   /* BUFPAGE_H */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index fede1e6..5ef703e 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -404,6 +404,7 @@ sub init
 	print $conf "fsync = off\n";
 	print $conf "log_statement = all\n";
 	print $conf "port = $port\n";
+	print $conf "wal_consistency = all\n";
 
 	if ($params{allows_streaming})
 	{
#47Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#46)
Re: WAL consistency check facility

On Mon, Sep 12, 2016 at 5:06 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

+ void (*rm_checkConsistency) (XLogReaderState *record);
All your _checkConsistency functions share the same pattern, in short
they all use a for loop for each block, call each time
XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
You would get a reduction by a couple of hundreds of lines by having a
smarter refactoring. And to be honest, if I look at your patch what I
think is the correct way of doing things is to add to the rmgr not
this check consistency function, but just a pointer to the masking
function.

+1. In rmgrlist, I've added a pointer to the masking function for each rmid.
A common function named checkConsistency calls these masking functions
based on their rmid and does comparison for each block.

The patch size is down from 79kB to 38kB. That gets better :)

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

If wal_consistency is enabled for a rmid, standby will always check whether
backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
(I guess Amit and Robert also suggested the same in the thread)
Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
image is required during redo. When we decode a wal record, has_image
flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.

Ah I see. But do we actually store the status in the record itself,
then at replay we don't care of the value of wal_consistency at
replay. That's the same concept used by wal_compression. So shouldn't
you have more specific checks related to that in checkConsistency? You
actually don't need to check for anything in xlogreader.c, just check
for the consistency if there is a need to do so, or do nothing.

For now, I've kept this as a WARNING message to detect all inconsistencies
at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Or say FATAL. This way the server is taken down.

Thoughts?

A couple of extra thoughts:
1) The routines of each rmgr are located in a dedicated file, for
example GIN stuff is in ginxlog.c, etc. It seems to me that it would
be better to move each masking function where it should be instead
being centralized. A couple of routines need to be centralized, so I'd
suggest putting them in a new file, like xlogmask.c, xlog because now
this is part of WAL replay completely, including the lsn, the hint
bint and the other common routines.

2) Regarding page comparison:
+/*
+ * Compare the contents of two pages.
+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
+ * it returns the location where the first mismatch has occurred.
+ */
+int
+comparePages(char *page1, char *page2)
We could just use memcpy() here. compareImages was useful to get a
clear image of what the inconsistencies were, but you don't do that
anymore.

3)
+static void checkConsistency(RmgrId rmid, XLogReaderState *record);
The RMGR if is part of the record decoded, so you could just remove
RmgrId from the list of arguments and simplify this interface.

4) If this patch still goes with the possibility to set up a list of
RMGRs, documentation is needed for that. I'd suggest writing first a
patch to explain what are RMGRs for WAL, then apply the WAL
consistency facility on top of it and link wal_consistency to it.

5)
+           has_image = record->blocks[block_id].has_image;
+           record->blocks[block_id].has_image = true;
+           if (!RestoreBlockImage(record, block_id, old_page))
+               elog(ERROR, "failed to restore block image");
+           record->blocks[block_id].has_image = has_image;
Er, what? And BKPIMAGE_IS_REQUIRED_FOR_REDO?
6)
+           /*
+            * Remember that, if WAL consistency check is enabled for
the current rmid,
+            * we always include backup image with the WAL record.
But, during redo we
+            * restore the backup block only if needs_backup is set.
+            */
+           if (needs_backup)
+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
This should use wal_consistency[rmid]?

7) This patch has zero documentation. Please add some. Any human being
on this list other than those who worked on the first versions
(Heikki, Simon and I?) is going to have a hard time to review this
patch in details moving on if there is no reference to tell what this
feature does for the user...

This patch is going to the good direction, but I don't think it's far
from being ready for commit yet. So I am going to mark it as returned
with feedback if there are no objections.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#43)
Re: WAL consistency check facility

On Sun, Sep 11, 2016 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It seems entirely unnecessary for the master and the standby to agree
here. I think what we need is two GUCs. One of them, which affects
only the master, controls whether the validation information is
including in the WAL, and the other, which affects only the standby,
affects whether validation is performed when the necessary information
is present. Or maybe skip the second one and just decree that
standbys will always validate if the necessary information is present.
Using the same GUC on both the master and the standby but making it
mean different things in each of those places (whether to log the
validation info in one case, whether to perform validation in the
other case) is another option that also avoids needing to enforce that
the setting is the same in both places, but probably an inferior one.

Thinking more about that, there is no actual need to do anything
complicated here. We could just track at the record level if a
consistency check is needs to be done at replay and do it. If nothing
is set, just do nothing. That would allow us to promote this parameter
to SIGHUP. wal_compression does something similar.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#47)
Re: WAL consistency check facility

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

If wal_consistency is enabled for a rmid, standby will always check whether
backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
(I guess Amit and Robert also suggested the same in the thread)
Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
image is required during redo. When we decode a wal record, has_image
flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.

Ah I see. But do we actually store the status in the record itself,
then at replay we don't care of the value of wal_consistency at
replay. That's the same concept used by wal_compression. So shouldn't
you have more specific checks related to that in checkConsistency? You
actually don't need to check for anything in xlogreader.c, just check
for the consistency if there is a need to do so, or do nothing.

I'm sorry, but I don't quite follow you here. If a wal record contains
an image, has_image should be set since it helps decoding the
record. But, during redo if XLogRecHasBlockImage() returns true, i.e.,
has_image is set, then it always restore the block. But, in our case,
a record can have a backup image which should not be restored. So, we need
to decide two things:
1. Does a record contain backup image? (required for decoding the record)
2. If it has an image, should we restore it during redo?
I think we sould decide these in DecodeXLogRecord() only. BKPBLOCK_HAS_IMAGE
answers the first question whereas BKPIMAGE_IS_REQUIRED_FOR_REDO
answers the second one. In DecodeXLogRecord(), we check that
BKPBLOCK_HAS_IMAGE should be set if wal_consistency is enabled for
this record. The flag has_image is set to
BKPIMAGE_IS_REQUIRED_FOR_REDO which is later used to decide whether we
want to restore a block or not.

For now, I've kept this as a WARNING message to detect all inconsistencies
at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Or say FATAL. This way the server is taken down.

Thoughts?

+1. I'll do that.

A couple of extra thoughts:
1) The routines of each rmgr are located in a dedicated file, for
example GIN stuff is in ginxlog.c, etc. It seems to me that it would
be better to move each masking function where it should be instead
being centralized. A couple of routines need to be centralized, so I'd
suggest putting them in a new file, like xlogmask.c, xlog because now
this is part of WAL replay completely, including the lsn, the hint
bint and the other common routines.

Sounds good. But, I think that the file name for common masking routines
should be as bufmask.c since we are masking the buffers only.

2) Regarding page comparison:
+/*
+ * Compare the contents of two pages.
+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
+ * it returns the location where the first mismatch has occurred.
+ */
+int
+comparePages(char *page1, char *page2)
We could just use memcpy() here. compareImages was useful to get a
clear image of what the inconsistencies were, but you don't do that
anymore.

memcmp(), right?

5)
+           has_image = record->blocks[block_id].has_image;
+           record->blocks[block_id].has_image = true;
+           if (!RestoreBlockImage(record, block_id, old_page))
+               elog(ERROR, "failed to restore block image");
+           record->blocks[block_id].has_image = has_image;
Er, what? And BKPIMAGE_IS_REQUIRED_FOR_REDO?

Sorry, I completely missed this.

6)
+           /*
+            * Remember that, if WAL consistency check is enabled for
the current rmid,
+            * we always include backup image with the WAL record.
But, during redo we
+            * restore the backup block only if needs_backup is set.
+            */
+           if (needs_backup)
+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
This should use wal_consistency[rmid]?

needs_backup is set when XLogRecordAssemble decides that backup image
should be included in the record for redo purpose. This image will be
restored during
redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
image should be restored during redo(or has_image should be set or not).

7) This patch has zero documentation. Please add some. Any human being
on this list other than those who worked on the first versions
(Heikki, Simon and I?) is going to have a hard time to review this
patch in details moving on if there is no reference to tell what this
feature does for the user...

This patch is going to the good direction, but I don't think it's far
from being ready for commit yet. So I am going to mark it as returned
with feedback if there are no objections

I think only major change that this patch needs a proper and detailed
documentation. Other than that there are very minor changes which can
be done quickly. Right?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#49)
Re: WAL consistency check facility

On Tue, Sep 13, 2016 at 6:07 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

For now, I've kept this as a WARNING message to detect all inconsistencies
at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Or say FATAL. This way the server is taken down.

What I'd really like to see here is a way to quickly identify in the
buildfarm the moment an inconsistent WAL has been introduced by a new
feature, some bug fix, or perhaps a deficiency in the masking
routines. We could definitely tune that later on, by controlling with
a GUC if this generates a WARNING instead of a FATAL, the former being
more useful for production environments, and the latter for tests. It
would be good to think as well about a set of tests, one rough thing
would be to modify an on-disk page for a table, and work on that to
force an inconsistency to be triggered..

A couple of extra thoughts:
1) The routines of each rmgr are located in a dedicated file, for
example GIN stuff is in ginxlog.c, etc. It seems to me that it would
be better to move each masking function where it should be instead
being centralized. A couple of routines need to be centralized, so I'd
suggest putting them in a new file, like xlogmask.c, xlog because now
this is part of WAL replay completely, including the lsn, the hint
bint and the other common routines.

Sounds good. But, I think that the file name for common masking routines
should be as bufmask.c since we are masking the buffers only.

That makes sense as well. No objections to that.

2) Regarding page comparison:
We could just use memcpy() here. compareImages was useful to get a
clear image of what the inconsistencies were, but you don't do that
anymore.

memcmp(), right?

Yep :)

6)
+           /*
+            * Remember that, if WAL consistency check is enabled for
the current rmid,
+            * we always include backup image with the WAL record.
But, during redo we
+            * restore the backup block only if needs_backup is set.
+            */
+           if (needs_backup)
+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
This should use wal_consistency[rmid]?

needs_backup is set when XLogRecordAssemble decides that backup image
should be included in the record for redo purpose. This image will be
restored during
redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
image should be restored during redo(or has_image should be set or not).

When decoding a record, I think that you had better not use has_image
to assume that a FPW has to be double-checked. This has better be a
different boolean flag, say check_page or similar. This way after
decoding a record it is possible to know if there is a PFW, and if a
check on it is needed or not.

7) This patch has zero documentation. Please add some. Any human being
on this list other than those who worked on the first versions
(Heikki, Simon and I?) is going to have a hard time to review this
patch in details moving on if there is no reference to tell what this
feature does for the user...

This patch is going to the good direction, but I don't think it's far
from being ready for commit yet. So I am going to mark it as returned
with feedback if there are no objections

I think only major change that this patch needs a proper and detailed
documentation. Other than that there are very minor changes which can
be done quickly. Right?

It seems to me that you need to think about the way to document things
properly first, with for example:
- Have a first documentation patch that explains what is a resource
manager for WAL, and what are the types available with a nice table.
- Add in your patch documentation to explain what are the benefits of
using this facility, the main purpose is testing, but there are also
mention upthread about users that would like to get that into
production, assuming that the overhead is minimal.
- Add more comments in your code to finish. One example is
checkConsistency() that is here, but explains nothing.

Well, if you'd simply use an on/off switch to control the feature, the
documentation load for rmgrs would be zero, but as I am visibly
outnumbered in this fight... We could also have an off/on switch
implemented first, and extend that later on depending on the feedback
from other users. We discussed rmgr-level or relation-level tuning of
FPW compression at some point, but we've finished with the most simple
approach, and we still stick with it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#50)
Re: WAL consistency check facility

On Wed, Sep 14, 2016 at 6:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

2) Regarding page comparison:
We could just use memcpy() here. compareImages was useful to get a
clear image of what the inconsistencies were, but you don't do that
anymore.

memcmp(), right?

Yep :)

If I use memcmp(), I won't get the byte location where the first mismatch
has occurred. It will be helpful to display the byte location which causes
an inconsistency.

6)
+           /*
+            * Remember that, if WAL consistency check is enabled for
the current rmid,
+            * we always include backup image with the WAL record.
But, during redo we
+            * restore the backup block only if needs_backup is set.
+            */
+           if (needs_backup)
+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
This should use wal_consistency[rmid]?

needs_backup is set when XLogRecordAssemble decides that backup image
should be included in the record for redo purpose. This image will be
restored during
redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
image should be restored during redo(or has_image should be set or not).

When decoding a record, I think that you had better not use has_image
to assume that a FPW has to be double-checked. This has better be a
different boolean flag, say check_page or similar. This way after
decoding a record it is possible to know if there is a PFW, and if a
check on it is needed or not.

I've done some modifications which discards the necessity of adding
anything in DecodeXLogRecord().

Master
---------------
- If wal_consistency check is enabled or needs_backup is set in
XLogRecordAssemble(), we do a fpw.
- If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
which in turns set has_image flag while decoding the record.
- If a fpw needs to be restored during redo, i.e., needs_backup is true,
then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.

Standby
---------------
- In XLogReadBufferForRedoExtended(), if both XLogRecHasBlockImage() and
XLogRecHasBlockImageForRedo()(added by me*) return true, we restore the
backup image.
- In checkConsistency, we only check if XLogRecHasBlockImage() returns true
when wal_consistency check is enabled for this rmid.

*XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
BKPIMAGE_IS_REQUIRED_FOR_REDO.

Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#51)
Re: WAL consistency check facility

On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

Master
---------------
- If wal_consistency check is enabled or needs_backup is set in
XLogRecordAssemble(), we do a fpw.
- If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
which in turns set has_image flag while decoding the record.
- If a fpw needs to be restored during redo, i.e., needs_backup is true,
then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.

Here that should be if wal_consistency is true, no?

Standby
---------------
- In XLogReadBufferForRedoExtended(), if both XLogRecHasBlockImage() and
XLogRecHasBlockImageForRedo()(added by me*) return true, we restore the
backup image.
- In checkConsistency, we only check if XLogRecHasBlockImage() returns true
when wal_consistency check is enabled for this rmid.

My guess would have been that you do not need to check anymore for
wal_consistency in checkConsistency, making the GUC value only used on
master node.

*XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
BKPIMAGE_IS_REQUIRED_FOR_REDO.

Yes, that's more or less what you should have.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#52)
Re: WAL consistency check facility

On Wed, Sep 14, 2016 at 11:31 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

Master
---------------
- If wal_consistency check is enabled or needs_backup is set in
XLogRecordAssemble(), we do a fpw.
- If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
which in turns set has_image flag while decoding the record.
- If a fpw needs to be restored during redo, i.e., needs_backup is true,
then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.

Here that should be if wal_consistency is true, no?

Nope. I'll try to explain using some pseudo-code:
XLogRecordAssemble()
{
....
include_image = needs_backup || wal_consistency[rmid];
if (include_image)
{
....
set XLogRecordBlockHeader.fork_flags |= BKPBLOCK_HAS_IMAGE;
if (needs_backup)
set XLogRecordBlockImageHeader.bimg_info
|= BKPIMAGE_IS_REQUIRED_FOR_REDO;
....
}
.....
}

XLogReadBufferForRedoExtended()
{
......
if (XLogRecHasBlockImage() && XLogRecHasBlockImageForRedo())
{
RestoreBlockImage();
....
return BLK_RESTORED;
}
......
}

checkConsistency()
{
....
if (wal_consistency[rmid] && !XLogRecHasBlockImage())
throw error;
.....
}

*XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
BKPIMAGE_IS_REQUIRED_FOR_REDO.

For a backup image any of the followings is possible:
1. consistency should be checked.
2. page should restored.
3. both 1 and 2.

Consistency check can be controlled by a guc parameter. But, standby
should be conveyed whether an image should be restored. For that, we
have used the new flag.
Suggestions?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Kuntal Ghosh (#53)
Re: WAL consistency check facility

On Thu, Sep 8, 2016 at 1:20 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
REVMAP page. This happens only for two cases. I'm not sure what the
reason can be.

Hm? This smells like a block reference bug. What are the cases you are
referring to?

Following is the only case where the backup page stored in the wal
record and the current page after redo are not consistent.

test:BRIN using gmake-check

Master
-----------------------------------------
STATEMENT: VACUUM brintest;
LOG: INSERT @ 0/59E1E0F8: - BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

Standby
----------------------------------------------
LOG: REDO @ 0/59E1B500; LSN 0/59E1E0F8: prev 0/59E17578; xid 0; len
14; blkref #0: rel 1663/16384/30556, blk 12; blkref #1: rel
1663/16384/30556, blk 1; blkref #2: rel 1663/16384/30556, blk 2 -
BRIN/UPDATE+INIT: heapBlk 100 pagesPerRange 1 old offnum 11, new
offnum 1

WARNING: Inconsistent page (at byte 26) found, rel 1663/16384/30556,
forknum 0, blkno 1
CONTEXT: xlog redo at 0/59E1B500 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

thoughts?
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#50)
Re: WAL consistency check facility

On Tue, Sep 13, 2016 at 9:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

It seems to me that you need to think about the way to document things
properly first, with for example:
- Have a first documentation patch that explains what is a resource
manager for WAL, and what are the types available with a nice table.
- Add in your patch documentation to explain what are the benefits of
using this facility, the main purpose is testing, but there are also
mention upthread about users that would like to get that into
production, assuming that the overhead is minimal.

So, I don't think that this patch should be required to document all
of the currently-undocumented stuff that somebody might want to know
that it is related to this patch. It should be enough to documented
the patch itself. One paragraph in config.sgml in the usual format
should be fine. Maybe two paragraphs. We do need to list the
resource managers, but that can just be something like this:

The default value of for this setting is <literal>off</>. To check
all records written to the write-ahead log, set this parameter to
<literal>all</literal>. To check only same records, specify a
comma-separated list of resource managers. The resource managers
which are currently supported are <literal>heap</>, <literal>btree</>,
<literal>hash</>, BLAH, and BLAH.

If somebody wants to write some user-facing documentation of the
write-ahead log format, great. That could certainly be very helpful
for people who are running pg_xlogdump. But I don't think that stuff
goes in this patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#55)
1 attachment(s)
Re: WAL consistency check facility

Hello,

I've added the updated the patch with the necessary documentation and comments.
I've referenced Robert's reply in this thread and Simon's reply in
Production block comparison facility thread to write the documentation.

This feature is used to check the consistency of WAL records, i.e,
whether the WAL records are inserted and applied correctly.
A guc parameter named wal_consistency is added to enable this feature.
When wal_consistency is enabled for a WAL record, it stores a full-page image
along with the record. When a full-page image arrives during redo, it compares
against the current page to check whether both are consistent.

The default value for this setting is none. To check all records written to the
write-ahead log, set this parameter to all. To check only some records, specify
a comma-separated list of resource managers. The resource managers which
are currently supported are xlog, heap2, heap, btree, hash, gin, gist, spgist,
sequence, brin and generic.

If any inconsistency is detected, it throws a WARNING. But, as per discussions
in the earlier threads, it can be changed to ERROR./FATAL(just a one
word change).
I've kept this as warning because of some inconsistency in BRIN VACUUM
during gmake check.

In recovery tests, I've enabled this feature in PostgresNode.pm.

Thanks to Amit, Dilip, Michael, Simon and Robert for their valuable feedbacks.

Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v8_base_commit_ID_c99dd5b.patchtext/x-patch; charset=US-ASCII; name=walconsistency_v8_base_commit_ID_c99dd5b.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cd66abc..8b251b7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2470,6 +2470,35 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        The default value for this setting is <literal>none</>. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>xlog</>, <literal>heap2</>,
+        <literal>heap</>, <literal>btree</>, <literal>hash</>, <literal>gin</>,
+        <literal>gist</>, <literal>spgist</>, <literal>sequence</>, <literal>brin</>
+        and <literal>generic</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..f4feb88 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,45 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page
+ */
+char *
+mask_brin_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber offnum,
+				maxoff;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* We need to handle brin pages of type Meta and Revmap if needed */
+
+	return (char *)page_norm;
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..a247807 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,40 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a Gin page
+ */
+char *
+mask_gin_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	GinPageOpaque opaque;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+
+	return (char *)page_norm;
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..778b7d7 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -443,3 +444,59 @@ gistXLogUpdate(Buffer buffer,
 
 	return recptr;
 }
+
+/*
+ * Mask a GIST page
+ */
+char *
+mask_gist_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber offnum,
+				maxoff;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/*Mask NSN*/
+	GistPageSetNSN(page_norm, 0xFFFFFFFFFFFFFFFF);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL record.
+	 * Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+	return (char *)page_norm;
+}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..952f7f6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
+#include "storage/bufmask.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
 
@@ -711,3 +712,60 @@ hash_redo(XLogReaderState *record)
 {
 	elog(PANIC, "hash_redo: unimplemented");
 }
+
+/*
+ * Mask a hash page
+ */
+char *
+mask_hash_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	HashPageOpaque opaque;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page_norm);
+	/*
+	 * Mask everything on a UNUSED page.
+	 */
+	if (opaque->hasho_flag & LH_UNUSED_PAGE)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)) - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else if ((opaque->hasho_flag & LH_META_PAGE)== 0)
+	{
+		/*
+		 * For pages other than metapage,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+	return (char *)page_norm;
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..1e80ddc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,71 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page
+ */
+char *
+mask_heap_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to current block number and offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+			{
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+			}
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+	return (char *)page_norm;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..cb6c96d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,64 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page
+ */
+char *
+mask_btree_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+
+	return (char *)page_norm;
+}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..aa4857b 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,28 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+char *
+mask_spg_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	Page	page_norm;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+
+	return (char *)page_norm;
+}
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 1926d98..d6b543e 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -16,6 +16,7 @@
 #include "access/generic_xlog.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 /*-------------------------------------------------------------------------
@@ -533,3 +534,12 @@ generic_redo(XLogReaderState *record)
 			UnlockReleaseBuffer(buffers[block_id]);
 	}
 }
+
+/*
+ * Mask a generic page
+ */
+char *
+mask_generic_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	return mask_common_page(info, blkno, page, true, true);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..3ca64d1 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -26,12 +26,13 @@
 #include "commands/tablespace.h"
 #include "replication/message.h"
 #include "replication/origin.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
+	{ name, redo, desc, identify, startup, cleanup, maskPage },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2189c22..f92d0ea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -25,6 +25,7 @@
 #include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
+#include "access/rmgr.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
@@ -52,6 +53,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/barrier.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -95,6 +97,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char		*wal_consistency_string = NULL;
+bool		*wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -870,6 +874,7 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void checkConsistency(XLogReaderState *record);
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6944,6 +6949,14 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with the WAL record
+				 * are consistent with the existing pages. This check is done only
+				 * if consistency check is enabled for the corresponding rmid.
+				 */
+				if (wal_consistency[record->xl_rmid])
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -11708,3 +11721,109 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Mask a xlog page
+ */
+char *
+mask_xlog_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	/*
+	 * In xlog redo, we just restore the page from backup image.
+	 * Hence, we can mask it by using the common function.
+	 */
+	return mask_common_page(info, blkno, page, false, false);
+}
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	uint8           info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	Page		new_page, old_page;
+	int		block_id;
+	int		inconsistent_loc;
+
+	/* Consistency is checked only for records with backup blocks*/
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		old_page = (Page) palloc(BLCKSZ);
+
+		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		{
+			Buffer buf;
+			char *norm_new_page, *norm_old_page;
+
+			if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+			{
+				/* Caller specified a bogus block_id. Don't do anything. */
+				continue;
+			}
+			/*
+			 * Read the contents from the current buffer
+			 * and store it in a temporary page.
+			 */
+			buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+			if (!BufferIsValid(buf))
+				continue;
+			new_page = BufferGetPage(buf);
+
+			/*
+			 * cross check that has_image is set if wal consistency check
+			 * is enabled for current rmid.
+			 */
+			if (wal_consistency[rmid] && !XLogRecHasBlockImage(record, block_id))
+			{
+				elog(ERROR,
+				 "WAL consistency check is enabled, but BKPBLOCK_HAS_IMAGE not set at rel %u/%u/%u, "
+					"forknum %u, blkno %u", rnode.spcNode, rnode.dbNode, rnode.relNode,
+					forknum, blkno);
+				return;
+			}
+
+			/*
+			 * Read the contents from the backup copy, stored in WAL record
+			 * and store it in a temporary page.
+			 */
+			if (!RestoreBlockImage(record, block_id, old_page))
+				elog(ERROR, "failed to restore block image");
+
+			/* Mask pages */
+			norm_new_page = RmgrTable[rmid].rm_maskPage(info, blkno, new_page);
+			norm_old_page = RmgrTable[rmid].rm_maskPage(info, blkno, old_page);
+
+			/* Time to compare the old and new contents */
+			inconsistent_loc = comparePages(norm_new_page, norm_old_page);
+
+			if (inconsistent_loc < BLCKSZ)
+				elog(WARNING,
+					"Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+					"forknum %u, blkno %u", inconsistent_loc,
+					rnode.spcNode, rnode.dbNode, rnode.relNode,
+					forknum, blkno);
+			else
+				elog(DEBUG3,
+					"Consistent page found, rel %u/%u/%u, "
+					"forknum %u, blkno %u",
+					rnode.spcNode, rnode.dbNode, rnode.relNode,
+					forknum, blkno);
+
+			pfree(norm_new_page);
+			pfree(norm_old_page);
+			ReleaseBuffer(buf);
+		}
+		pfree(old_page);
+	}
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..8f254b3 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -513,6 +513,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image; /* Whether backup image should be included in WAL record */
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +557,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for current rmid,
+		 * we do a fpw for the current block.
+		 */
+		include_image = needs_backup || wal_consistency[rmid];
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -618,6 +625,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * Remember that, if WAL consistency check is enabled for the current rmid,
+			 * we always include backup image with the WAL record. But, during redo we
+			 * restore the backup block only if needs_backup is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -680,7 +694,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..e0f62ea 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -352,8 +352,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If it has a full-page image for redo purpose, restore it. */
+	if (XLogRecHasBlockImageForRedo(record, block_id))
 	{
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..80c6fa8 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,12 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page
+ */
+char *
+mask_seq_page(uint8 info, BlockNumber blkno, const char *page)
+{
+	return mask_common_page(info, blkno, page, false, true);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..0270be6
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,105 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * Mask Page LSN
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+	PageXLogRecPtrSet(phdr->pd_lsn, 0xFFFFFFFFFFFFFFFF);
+}
+
+/*
+ * Mask hint bits in PageHeader
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a common page, i.e., mask the lsn, hint bits, and unused space between
+ * pd_lower and pd_upper. Although, hint bits and unused space can be masked
+ * optionally.
+ */
+char *
+mask_common_page(uint8 info, BlockNumber blkno, const char *page, bool maskHints, bool maskUnusedSpace)
+{
+	Page	page_norm;
+
+	page_norm = (Page) palloc(BLCKSZ);
+	memcpy(page_norm, page, BLCKSZ);
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	if (maskHints)
+		mask_page_hint_bits(page_norm);
+
+	if (maskUnusedSpace)
+		mask_unused_space(page_norm);
+
+	return (char *)page_norm;
+}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 73aa0c0..877b58d 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1177,3 +1177,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, blkno);
 }
+
+/*
+ * Compare the contents of two pages.
+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
+ * it returns the byte location where the first mismatch has occurred.
+ */
+int
+comparePages(char *page1, char *page2)
+{
+	int		i;
+
+	for (i = 0; i < BLCKSZ ; i++)
+	{
+		uint8 byte1 = (uint8) page1[i];
+		uint8 byte2 = (uint8) page2[i];
+		if(byte1 != byte2)
+			break;
+	}
+	return i;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c72bd61..14128a5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -144,6 +146,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3248,6 +3253,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Sets the rmgrIDs for which WAL consistency should be checked."),
+			gettext_noop("Valid values are combinations of rmgrIDs"),
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"none",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9903,6 +9919,122 @@ assign_log_destination(const char *newval, void *extra)
 	Log_destination = *((int *) extra);
 }
 
+static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   	*rawstring;
+	List	   	*elemlist;
+	ListCell   	*l;
+	bool		*newwalconsistency;
+	bool		isRmgrId = false;	/* Does this guc include any individual rmid? */
+	bool		isAll = false;	/* Does this guc include 'all' keyword? */
+	bool		isNone = false;	/* Does this guc include 'none' keyword? */
+	int		i;
+
+	newwalconsistency = (bool *) guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Initialize the array*/
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char		*tok = (char *) lfirst(l);
+		bool		found = false;
+
+		/* Check if the token matches with any individual rmid */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if maskPage function
+				 * is defined for this rmid. We'll enable this feature
+				 * only for the rmids for which a masking function is defined.
+				 */
+				if (RmgrTable[i].rm_maskPage != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+		/* If a valid rmid is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual rmid. Check for 'none' and 'all'. */
+		if (pg_strcasecmp(tok, "none") == 0)
+		{
+			MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+			isNone = true;
+		}
+		else if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * We'll enable this feature only for the rmids for which
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_maskPage != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/* guc should contain either 'all' or 'none' or combination of rmids. */
+	if ((isAll && isNone) || (isAll && isRmgrId) || (isNone && isRmgrId))
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	*extra = (void *) newwalconsistency;
+
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
 static void
 assign_syslog_facility(int newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b1c3aea..93041a1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,11 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = 'none'		# Valid values are combinations of
+					# heap2, heap, btree, hash, gin, gist, sequence,
+					# spgist, brin, generic and xlog. It can also
+					# be set to all to enable all the values.
+					# (change requires restart)
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index b53591d..baeeecc 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..f962e79 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..6c53b3f 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern char *mask_brin_page(uint8 info, BlockNumber blkno, const char *page);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h
index 63f2120..16135e1 100644
--- a/src/include/access/generic_xlog.h
+++ b/src/include/access/generic_xlog.h
@@ -40,5 +40,6 @@ extern void GenericXLogAbort(GenericXLogState *state);
 extern void generic_redo(XLogReaderState *record);
 extern const char *generic_identify(uint8 info);
 extern void generic_desc(StringInfo buf, XLogReaderState *record);
+extern char *mask_generic_page(uint8 info, BlockNumber blkno, const char *page);
 
 #endif   /* GENERIC_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..a359545 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern char *mask_gin_page(uint8 info, BlockNumber blkno, const char *page);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 1231585..787a643 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -464,6 +464,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern char *mask_gist_page(uint8 info, BlockNumber blkno, const char *page);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..54780f2 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -21,5 +21,6 @@
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
+extern char *mask_hash_page(uint8 info, BlockNumber blkno, const char *page);
 
 #endif   /* HASH_XLOG_H */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..afffc26 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern char *mask_heap_page(uint8 info, BlockNumber blkno, const char *page);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..abaf275 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern char *mask_btree_page(uint8 info, BlockNumber blkno, const char *page);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..0d2bc1a 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..9e693b4 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, mask_xlog_page)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, mask_heap_page)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, mask_heap_page)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, mask_btree_page)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, mask_hash_page)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, mask_gin_page)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, mask_gist_page)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, mask_seq_page)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, mask_spg_page)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, mask_brin_page)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, mask_generic_page)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..6e52ea3 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern char *mask_spg_page(uint8 info, BlockNumber blkno, const char *page);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..0238d21 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
@@ -226,6 +228,7 @@ extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
 extern void xlog_redo(XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
+extern char *mask_xlog_page(uint8 info, BlockNumber blkno, const char *page);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 0a595cc..47fb0d0 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -276,6 +276,8 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	char		*(*rm_maskPage) (uint8 info, BlockNumber blkno,
+							const char *page);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..5112e60 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,9 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecHasBlockImageForRedo(decoder, block_id) \
+	(XLogRecHasBlockImage(decoder, block_id) && \
+	(((decoder)->blocks[block_id].bimg_info & BKPIMAGE_IS_REQUIRED_FOR_REDO) > 0))
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..d747ab1 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -137,6 +137,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_IS_REQUIRED_FOR_REDO		0x04	/* page is required during redo */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..a26102f 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern char *mask_seq_page(uint8 info, BlockNumber blkno, const char *page);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..0af9c35
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+extern char *mask_common_page(uint8 info, BlockNumber blkno,
+					const char *page, bool maskHints,
+					bool maskUnusedSpace);
+#endif
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..a5f34d3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -435,4 +435,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
 
+extern int comparePages(Page norm_new_page, Page norm_old_page);
+
 #endif   /* BUFPAGE_H */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index fede1e6..5ef703e 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -404,6 +404,7 @@ sub init
 	print $conf "fsync = off\n";
 	print $conf "log_statement = all\n";
 	print $conf "port = $port\n";
+	print $conf "wal_consistency = all\n";
 
 	if ($params{allows_streaming})
 	{
#57Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#56)
Re: WAL consistency check facility

On Thu, Sep 15, 2016 at 7:30 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

Thoughts?

There are still a couple of things that this patch makes me unhappy,
particularly the handling of the GUC with the xlogreader flags. I am
not sure if I'll be able to look at that again within the next couple
of weeks, but please be sure that this is registered in the next
commit fest. You could for example do that by changing the patch from
"Returned with Feedback" to "Moved to next CF" in the commit fest app.
Be sure as well to spend a couple of cycles in reviewing patches.
Usually for one patch sent, that's one patch of equal difficulty to
review, and there are many patch still waiting for feedback.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#57)
Re: WAL consistency check facility

On Thu, Sep 15, 2016 at 9:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 15, 2016 at 7:30 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

Thoughts?

There are still a couple of things that this patch makes me unhappy,
particularly the handling of the GUC with the xlogreader flags. I am
not sure if I'll be able to look at that again within the next couple
of weeks, but please be sure that this is registered in the next
commit fest. You could for example do that by changing the patch from
"Returned with Feedback" to "Moved to next CF" in the commit fest app.
Be sure as well to spend a couple of cycles in reviewing patches.
Usually for one patch sent, that's one patch of equal difficulty to
review, and there are many patch still waiting for feedback.

I don't think you have the right to tell Kuntal that he has to move
the patch to the next CommitFest because there are unspecified things
about the current version you don't like. If you don't have time to
review further, that's your call, but he can leave the patch as Needs
Review and see if someone else has time.

You are right that he should review some other people's patches, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#58)
Re: WAL consistency check facility

On Fri, Sep 16, 2016 at 10:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think you have the right to tell Kuntal that he has to move
the patch to the next CommitFest because there are unspecified things
about the current version you don't like. If you don't have time to
review further, that's your call, but he can leave the patch as Needs
Review and see if someone else has time.

No complain from here if done this way. I don't mean any offense :)
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#59)
Re: WAL consistency check facility

On Fri, Sep 16, 2016 at 10:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 16, 2016 at 10:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think you have the right to tell Kuntal that he has to move
the patch to the next CommitFest because there are unspecified things
about the current version you don't like. If you don't have time to
review further, that's your call, but he can leave the patch as Needs
Review and see if someone else has time.

No complain from here if done this way. I don't mean any offense :)

Seeing nothing happening, I have moved the patch to next CF as there
is a new version, but no reviews for it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#60)
Re: WAL consistency check facility

On Thu, Sep 29, 2016 at 12:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Seeing nothing happening, I have moved the patch to next CF as there
is a new version, but no reviews for it.

Just a note for anybody potentially looking at this patch. I am
currently looking at it in depth, and will post a new version of the
patch in a couple of days with review comments. Thanks.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#61)
1 attachment(s)
Re: WAL consistency check facility

On Thu, Oct 27, 2016 at 5:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 29, 2016 at 12:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Seeing nothing happening, I have moved the patch to next CF as there
is a new version, but no reviews for it.

Just a note for anybody potentially looking at this patch. I am
currently looking at it in depth, and will post a new version of the
patch in a couple of days with review comments. Thanks.

And here we go. Here is a review as well as a large brush-up for this
patch. A couple of things:
- wal_consistency is using a list of RMGRs, at the cost of being
PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
have been thinking hard about that, and still I don't see the point).
It is rather easy to for example default it to false, and enable it to
true to check if a certain code path is correctly exercised or not for
WAL consistency. Note that this simplification reduces the patch size
by 100~150 lines. I know, I know, I'd expect some complains about
that....
- Looking for wal_consistency at replay has no real value. What if on
a standby the parameter value is inconsistent than the one on the
master? This logic adds a whole new level of complications and
potential bugs. So instead my suggestion is to add a marker at WAL
record level to check if this record should be checked for consistency
at replay or not. This is also quite flexible if you think about it,
the standby is made independent of the WAL generated on the master and
just applies, or checks what it sees is fit checking for. The best
match here is to add a flag for XLogRecord->xl_info and make use of
one of the low 4 bits and only one is used now for
XLR_SPECIAL_REL_UPDATE. An interesting side effect of this approach is
that callers of XLogInsert can set XLR_CHECK_CONSISTENCY to enforce a
consistency check even if wal_consistency is off. It is true that we
could register such a data via XLogRegisterBuffer() instead, though
the 4 bits with the BKPBLOCK_* flags are already occupied so that
would induce a record penalty length and I have a hard time believing
that one would like to check the consistency of a record in
particular.
- Speaking of which using BKPIMAGE_IS_REQUIRED_FOR_REDO stored in the
block definition is sort of weird because we want to know if
consistency should be checked at a higher level.
- in maskPage, the new rmgr routine, there is no need for the info and
blkno arguments. info is not used at all to begin with. blkno is used
for gin pages to detect meta pages but this can be guessed using the
opaque pointer. For heap pages and speculative inserts, masking the
blkno would be fine. That's not worth it.
- Instead of palloc'ing the old and new pages to compare, it would be
more performant to keep around two static buffers worth of BLCKSZ and
just use that. This way there is no need as well to perform any palloc
calls in the masking functions, limiting the risk of errors (those
code paths had better avoid errors IMO). It would be also less costly
to just pass to the masking function a pointer to a buffer of size
BLCKSZ and just do the masking on it.
- The masking routine names can be more generic, like XXX_mask(char
*page). No need to say page, we already know they work on it via the
argument provided.
- mask_xlog_page and mask_generic_page are useless as the block
restored comes directly from a FPW, so you are comparing basically a
FPW with itself.
- In checkConsistency, there is no need to allocate the old page. As
RestorebackupImage stores the data in an already allocated buffer, you
can reuse the same location as the buffer masked afterwards.
- Removed comparePages(), using memcmp instead for simplicity(). This
does not show up the exact location of the inconsistency, still that
won't be a win as there could be more than one inconsistency across a
page. So this gives an invitation to user to look at the exact
context. memcmp can be used anyway to understand where is the
inconsistency if need be.
- I have noticed that mask_common_page is meaningfull just for the
sequence RMGR, and just that does not justify its existence so I
ripped it off.
- PostgresNode.pm enables wal_consistency. Seeing the high amount of
WAL this produces, I feel cold about doing that, the patch does
include it btw...
- Standbys now stop with FATAL when an inconsistency is found. This
makes error detection easier on buildfarm machines.
- A couple of masking functions still use 0xFFFFFF or similar marks.
Those should be replaced by MASK_MARKING. Not done that yet.
- Some of the masking routines should be refined, particularly the
heap and GIn functions. I did not spend time yet to do it.

On top of that, I have done a fair amount of testing, creating
manually some inconsistencies in the REDO routines to trigger failures
on standbys. And that was sort of fun to break things intentionally.

Another fun thing is the large amount of WAL that this generates (!),
so anyone willing to enable that in production would be crazy.
Enabling it for development and/or session is something that would
clearly help.

I am sending back the patch as waiting on author. Attached is what I
have up to now.
--
Michael

Attachments:

walconsistency_v9.patchtext/x-diff; charset=US-ASCII; name=walconsistency_v9.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..f9cfe16 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly.  When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image of each page modified along with the WAL
+        record, inducing an increase of WAL generation.  Then, When a
+        full-page image arrives during redo, it compares against the current
+        page to check whether both are consistent or not.  The default value
+        is <literal>off</>.  Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..a19cfbb 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,38 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * If necessary, handle the case of meta and revmap pages here.
+	 */
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..90b6386 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,28 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+void
+gin_mask(char *page)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (opaque->flags != GIN_META)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..d0573f0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,55 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gist_mask(char *page)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/* Mask NSN */
+	/* XXX: Rework that */
+	GistPageSetNSN(page_norm, 0xFFFFFFFFFFFFFFFF);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL record.
+	 * Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..aa4705a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
+#include "storage/bufmask.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
 
@@ -711,3 +712,53 @@ hash_redo(XLogReaderState *record)
 {
 	elog(PANIC, "hash_redo: unimplemented");
 }
+
+/*
+ * Mask a hash page before performing consistency checks on it.
+ */
+void
+hash_mask(char *page)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	HashPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page_norm);
+
+	/*
+	 * Mask everything on a UNUSED page.
+	 */
+	if (opaque->hasho_flag & LH_UNUSED_PAGE)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)) - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else if ((opaque->hasho_flag & LH_META_PAGE)== 0)
+	{
+		/*
+		 * For pages other than metapage,
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..52b157a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,65 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to block number 0 and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, 0, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..8dc6234 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,59 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be refined.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..4ec8688 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,23 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page)
+{
+	Page	page_norm = (Page) page;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9bb46b..0ce4e5c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,7 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+bool		wal_consistency = false;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -867,6 +868,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -903,8 +905,9 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 	pg_crc32c	rdata_crc;
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
+	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
 	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   rechdr->xl_info == XLOG_SWITCH);
+							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 
@@ -1261,6 +1264,91 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (!RmgrTable[rmid].rm_mask)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+		char	norm_new_page[BLCKSZ];
+		char	norm_old_page[BLCKSZ];
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Do nothing. */
+			continue;
+		}
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, norm_old_page))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(norm_new_page, new_page, BLCKSZ);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(norm_new_page);
+		RmgrTable[rmid].rm_mask(norm_old_page);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(norm_new_page, norm_old_page, BLCKSZ) != 0)
+			elog(FATAL,
+				 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+
+		ReleaseBuffer(buf);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6948,6 +7036,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -7785,6 +7882,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report)
 {
 	XLogRecord *record;
+	uint8		info;
 
 	if (!XRecOffIsValid(RecPtr))
 	{
@@ -7810,6 +7908,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 	}
 
 	record = ReadRecord(xlogreader, RecPtr, LOG, true);
+	info = record->xl_info & ~XLR_INFO_MASK;
 
 	if (record == NULL)
 	{
@@ -7852,8 +7951,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		}
 		return NULL;
 	}
-	if (record->xl_info != XLOG_CHECKPOINT_SHUTDOWN &&
-		record->xl_info != XLOG_CHECKPOINT_ONLINE)
+	if (info != XLOG_CHECKPOINT_SHUTDOWN &&
+		info != XLOG_CHECKPOINT_ONLINE)
 	{
 		switch (whichChkpt)
 		{
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..db5f37f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -414,10 +414,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -513,6 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image; /* Whether backup image should be included in WAL record */
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +559,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current rmid, log a full-page write for the current block.
+		 */
+		include_image = needs_backup || wal_consistency;
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -680,7 +689,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
@@ -756,6 +765,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	rechdr->xl_prev = InvalidXLogRecPtr;
 	rechdr->xl_crc = rdata_crc;
 
+	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it.
+	 */
+	if (wal_consistency)
+		rechdr->xl_info |= XLR_CHECK_CONSISTENCY;
+
 	return &hdr_rdt;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f2da505..56d4c66 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -462,7 +462,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	/*
 	 * Special processing if it's an XLOG SWITCH record
 	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	if (record->xl_rmid == RM_XLOG_ID &&
+		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
 		state->EndRecPtr += XLogSegSize - 1;
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..1eaf79f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..f30c477
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,78 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * Mask Page LSN
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, 0xFFFFFFFFFFFFFFFF);
+}
+
+/*
+ * Mask hint bits in PageHeader
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..d0416ae 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -1028,6 +1030,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"wal_consistency", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets consistency of WAL records with existing pages at replay."),
+			NULL
+		},
+		&wal_consistency,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"log_checkpoints", PGC_SIGHUP, LOGGING_WHAT,
 			gettext_noop("Logs each checkpoint."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..4a98d87 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,7 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = off			# enables consistency checks at WAL replay
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..f962e79 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..4ad9ab3 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..6360ba1 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..ccf22a6 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..3259c71 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -21,5 +21,6 @@
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
+extern void hash_mask(char *page);
 
 #endif   /* HASH_XLOG_H */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..a519dc5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..53f23d3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..5509cab 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..822b094 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..f8192d2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool wal_consistency;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..1202fbd 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..74d5aa0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, not that if wal_consistency
+ * is enabled this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..f555899 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..874c25f
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..a5f34d3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -435,4 +435,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
 
+extern int comparePages(Page norm_new_page, Page norm_old_page);
+
 #endif   /* BUFPAGE_H */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index c1b16ca..d986273 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -412,6 +412,7 @@ sub init
 	print $conf "log_line_prefix = '%m [%p] %q%a '\n";
 	print $conf "log_statement = all\n";
 	print $conf "port = $port\n";
+	print $conf "wal_consistency = on\n";
 
 	if ($params{allows_streaming})
 	{
#63Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#62)
Re: WAL consistency check facility

On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

And here we go. Here is a review as well as a large brush-up for this
patch. A couple of things:
- wal_consistency is using a list of RMGRs, at the cost of being
PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
have been thinking hard about that, and still I don't see the point).
It is rather easy to for example default it to false, and enable it to
true to check if a certain code path is correctly exercised or not for
WAL consistency. Note that this simplification reduces the patch size
by 100~150 lines. I know, I know, I'd expect some complains about
that....

I don't understand how you can fail to see the point of that. As you
yourself said, this facility generates a ton of WAL. If you're
focusing on one AM, why would you want to be forced to incur the
overhead for every other AM? A good deal has been written about this
upthread already, and just saying "I don't see the point" seems to be
ignoring the explanations already given.

- Looking for wal_consistency at replay has no real value. What if on
a standby the parameter value is inconsistent than the one on the
master? This logic adds a whole new level of complications and
potential bugs. So instead my suggestion is to add a marker at WAL
record level to check if this record should be checked for consistency
at replay or not.

Agreed.

This is also quite flexible if you think about it,
the standby is made independent of the WAL generated on the master and
just applies, or checks what it sees is fit checking for.

+1.

The best
match here is to add a flag for XLogRecord->xl_info and make use of
one of the low 4 bits and only one is used now for
XLR_SPECIAL_REL_UPDATE.

Seems reasonable.

- in maskPage, the new rmgr routine, there is no need for the info and
blkno arguments. info is not used at all to begin with. blkno is used
for gin pages to detect meta pages but this can be guessed using the
opaque pointer. For heap pages and speculative inserts, masking the
blkno would be fine. That's not worth it.

Passing the blkno doesn't cost anything. If it avoids guessing,
that's entirely worth it.

- Instead of palloc'ing the old and new pages to compare, it would be
more performant to keep around two static buffers worth of BLCKSZ and
just use that. This way there is no need as well to perform any palloc
calls in the masking functions, limiting the risk of errors (those
code paths had better avoid errors IMO). It would be also less costly
to just pass to the masking function a pointer to a buffer of size
BLCKSZ and just do the masking on it.

We always palloc buffers like this so that they will be aligned. But
we could arrange not to repeat the palloc every time (see, e.g.,
BootstrapXLOG()).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#62)
Re: WAL consistency check facility

On Fri, Oct 28, 2016 at 11:35 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

And here we go. Here is a review as well as a large brush-up for this
patch. A couple of things:

Thanks for reviewing the patch in detail.

- Speaking of which using BKPIMAGE_IS_REQUIRED_FOR_REDO stored in the
block definition is sort of weird because we want to know if
consistency should be checked at a higher level.

A full page image can be included in the WAL record because of the following
reasons:
1. It needs to be restored during replay.
2. WAL consistency should be checked for the record.
3. Both of above.
In your patch, you've included a full page image whenever wal_consistency
is true. So, XLogReadBufferForRedoExtended always restores the image
and returns BLK_RESTORED, which is unacceptable. We can't change
the default WAL replay behaviour. A full image should only be restored if it is
necessary to do so. Although, I agree that BKPIMAGE_IS_REQUIRED_FOR_REDO
doesn't look a clean way to implement this feature.

- wal_consistency is using a list of RMGRs, at the cost of being
PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
have been thinking hard about that, and still I don't see the point).
It is rather easy to for example default it to false, and enable it to
true to check if a certain code path is correctly exercised or not for
WAL consistency. Note that this simplification reduces the patch size
by 100~150 lines. I know, I know, I'd expect some complains about
that....

As Robert also told, if I'm focusing on a single AM, I really don't
want to store
full images and perform consistency check for other AMs.

On top of that, I have done a fair amount of testing, creating
manually some inconsistencies in the REDO routines to trigger failures
on standbys. And that was sort of fun to break things intentionally.

I know the feeling. :)

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#63)
Re: WAL consistency check facility

On Mon, Oct 31, 2016 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

And here we go. Here is a review as well as a large brush-up for this
patch. A couple of things:
- wal_consistency is using a list of RMGRs, at the cost of being
PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
have been thinking hard about that, and still I don't see the point).
It is rather easy to for example default it to false, and enable it to
true to check if a certain code path is correctly exercised or not for
WAL consistency. Note that this simplification reduces the patch size
by 100~150 lines. I know, I know, I'd expect some complains about
that....

I don't understand how you can fail to see the point of that. As you
yourself said, this facility generates a ton of WAL. If you're
focusing on one AM, why would you want to be forced to incur the
overhead for every other AM? A good deal has been written about this
upthread already, and just saying "I don't see the point" seems to be
ignoring the explanations already given.

Hehe, I was expecting you to jump on those lines. While looking at the
patch I have simplified it first to focus on the core engine of the
thing. Adding back this code sounds fine to me as there is a wall of
contestation. I offer to do it myself if the effort is the problem.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#63)
Re: WAL consistency check facility

On Mon, Oct 31, 2016 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier

- Instead of palloc'ing the old and new pages to compare, it would be
more performant to keep around two static buffers worth of BLCKSZ and
just use that. This way there is no need as well to perform any palloc
calls in the masking functions, limiting the risk of errors (those
code paths had better avoid errors IMO). It would be also less costly
to just pass to the masking function a pointer to a buffer of size
BLCKSZ and just do the masking on it.

We always palloc buffers like this so that they will be aligned. But
we could arrange not to repeat the palloc every time (see, e.g.,
BootstrapXLOG()).

Yeah, we could go with that and there is clearly no reason to not do so.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#65)
Re: WAL consistency check facility

On Mon, Oct 31, 2016 at 5:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Hehe, I was expecting you to jump on those lines. While looking at the
patch I have simplified it first to focus on the core engine of the
thing. Adding back this code sounds fine to me as there is a wall of
contestation. I offer to do it myself if the effort is the problem.

IMHO, your rewrite of this patch was a bit heavy-handed. I haven't
scrutinized the code here so maybe it was a big improvement, and if so
fine, but if not it's better to collaborate with the author than to
take over. In any case, yeah, I think you should put that back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#63)
Re: WAL consistency check facility

On Mon, Oct 31, 2016 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

And here we go. Here is a review as well as a large brush-up for this
patch. A couple of things:
- wal_consistency is using a list of RMGRs, at the cost of being
PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
have been thinking hard about that, and still I don't see the point).
It is rather easy to for example default it to false, and enable it to
true to check if a certain code path is correctly exercised or not for
WAL consistency. Note that this simplification reduces the patch size
by 100~150 lines. I know, I know, I'd expect some complains about
that....

I don't understand how you can fail to see the point of that. As you
yourself said, this facility generates a ton of WAL. If you're
focusing on one AM, why would you want to be forced to incur the
overhead for every other AM? A good deal has been written about this
upthread already, and just saying "I don't see the point" seems to be
ignoring the explanations already given.

+1. I strongly agree.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#67)
1 attachment(s)
Re: WAL consistency check facility

On Tue, Nov 1, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

IMHO, your rewrite of this patch was a bit heavy-handed.

OK... Sorry for that.

I haven't
scrutinized the code here so maybe it was a big improvement, and if so
fine, but if not it's better to collaborate with the author than to
take over.

While reviewing the code, that has finished by being a large rewrite,
and that was more understandable than a review looking at all the
small tweaks and things I have been through while reading it. I have
also experimented a couple of ideas with the patch that I added, so at
the end it proves to be a gain for everybody. I think that the last
patch is an improvement, if you want to make your own opinion on the
matter looking at the differences between both patches would be the
most direct way to go.

In any case, yeah, I think you should put that back.

Here you go with this parameter back and the allocation of the masked
buffers done beforehand, close to the moment the XLogReader is
allocated actually. I have also removed wal_consistency from
PostgresNode.pm, small buildfarm machines would really suffer on it,
and hamster is very good to track race conditions when running TAP
tests. On top of that I have replaced a bunch of 0xFFFFF thingies by
their PG_UINT_MAX equivalents to keep things cleaner.

Now, I have put back the GUC-related code exactly to the same shape as
it was originally. Here are a couple of comments regarding it after
review:
- Let's drop 'none' as a magic keyword. Users are going to use an
empty string, and the default should be defined as such IMO.
- Using an allocated array of booleans to store the values of each
RMGRs could be replaced by an integer using bitwise shifts. Your
option looks better and makes the code cleaner.

A more nitpick remark: code comments don't refer much to RMIDs, but
they use the term "resource managers" more generally. I'd suggest to
do the same.
--
Michael

Attachments:

walconsistency_v10.patchapplication/x-download; name=walconsistency_v10.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..d7079f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,38 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        The default value for this setting is <literal>none</>. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>hash</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..2af524d 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,38 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * If necessary, handle the case of meta and revmap pages here.
+	 */
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..3e4281e 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,28 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (opaque->flags != GIN_META)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..96c30c0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,52 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/* Mask NSN */
+	GistPageSetNSN(page_norm, PG_UINT64_MAX);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record.  Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages, mask some line pointer bits, particularly
+		 * those marked as used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..fd2ff15 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
+#include "storage/bufmask.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
 
@@ -711,3 +712,52 @@ hash_redo(XLogReaderState *record)
 {
 	elog(PANIC, "hash_redo: unimplemented");
 }
+
+/*
+ * Mask a hash page before performing consistency checks on it.
+ */
+void
+hash_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	HashPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page_norm);
+
+	/*
+	 * Mask everything on a UNUSED page.
+	 */
+	if (opaque->hasho_flag & LH_UNUSED_PAGE)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)) - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else if ((opaque->hasho_flag & LH_META_PAGE)== 0)
+	{
+		/*
+		 * For pages other than metapage, mask some line pointer bits,
+		 * particularly those marked as used on a master and unused on
+		 * a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..72a43a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,65 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to the current block number and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..400df0d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,58 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..47c3467 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,23 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6b1f24e..ccaa390 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_string = NULL;
+bool	   *wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Aligned Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -867,6 +873,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -903,8 +910,9 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 	pg_crc32c	rdata_crc;
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
+	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
 	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   rechdr->xl_info == XLOG_SWITCH);
+							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 
@@ -1261,6 +1269,89 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (RmgrTable[rmid].rm_mask == NULL)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Do nothing. */
+			continue;
+		}
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, old_page_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(new_page_masked, new_page, BLCKSZ);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(new_page_masked, blkno);
+		RmgrTable[rmid].rm_mask(old_page_masked, blkno);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(new_page_masked, old_page_masked, BLCKSZ) != 0)
+			elog(FATAL,
+				 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+
+		ReleaseBuffer(buf);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6148,6 +6239,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	new_page_masked = (char *) palloc(BLCKSZ);
+	old_page_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6948,6 +7046,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -7478,6 +7585,12 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/* Clean up buffers dedicated to WAL consistency checks */
+	if (old_page_masked)
+		pfree(old_page_masked);
+	if (new_page_masked)
+		pfree(new_page_masked);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
@@ -7785,6 +7898,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report)
 {
 	XLogRecord *record;
+	uint8		info;
 
 	if (!XRecOffIsValid(RecPtr))
 	{
@@ -7810,6 +7924,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 	}
 
 	record = ReadRecord(xlogreader, RecPtr, LOG, true);
+	info = record->xl_info & ~XLR_INFO_MASK;
 
 	if (record == NULL)
 	{
@@ -7852,8 +7967,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		}
 		return NULL;
 	}
-	if (record->xl_info != XLOG_CHECKPOINT_SHUTDOWN &&
-		record->xl_info != XLOG_CHECKPOINT_ONLINE)
+	if (info != XLOG_CHECKPOINT_SHUTDOWN &&
+		info != XLOG_CHECKPOINT_ONLINE)
 	{
 		switch (whichChkpt)
 		{
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..c54ca75 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -414,10 +414,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -513,6 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image; /* Whether backup image should be included in WAL record */
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +559,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current rmid, log a full-page write for the current block.
+		 */
+		include_image = needs_backup || wal_consistency[rmid];
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -680,7 +689,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
@@ -756,6 +765,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	rechdr->xl_prev = InvalidXLogRecPtr;
 	rechdr->xl_crc = rdata_crc;
 
+	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it.
+	 */
+	if (wal_consistency)
+		rechdr->xl_info |= XLR_CHECK_CONSISTENCY;
+
 	return &hdr_rdt;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f2da505..56d4c66 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -462,7 +462,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	/*
 	 * Special processing if it's an XLOG SWITCH record
 	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	if (record->xl_rmid == RM_XLOG_ID &&
+		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
 		state->EndRecPtr += XLogSegSize - 1;
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..864d6a9 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..6c67e3e
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,78 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * Mask Page LSN
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+}
+
+/*
+ * Mask hint bits in PageHeader
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = PG_UINT32_MAX;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..5d518b6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -145,6 +147,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3254,6 +3259,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			 NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"none",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9867,6 +9882,122 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   	*rawstring;
+	List	   	*elemlist;
+	ListCell   	*l;
+	bool		*newwalconsistency;
+	bool		isRmgrId = false;	/* Does this guc include any individual rmid? */
+	bool		isNone = false;	/* Does this guc include 'none' keyword? */
+	bool		isAll = false;	/* Does this guc include 'all' keyword? */
+	int		i;
+
+	newwalconsistency = (bool *) guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Initialize the array*/
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char		*tok = (char *) lfirst(l);
+		bool		found = false;
+
+		/* Check if the token matches with any individual rmid */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if maskPage function
+				 * is defined for this rmid. We'll enable this feature
+				 * only for the rmids for which a masking function is defined.
+				 */
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+		/* If a valid rmid is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual rmid. Check for 'none' and 'all'. */
+		if (pg_strcasecmp(tok, "none") == 0)
+		{
+			MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+			isNone = true;
+		}
+		else if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * We'll enable this feature only for the rmids for which
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/* guc should contain either 'all' or 'none' or combination of rmids. */
+	if ((isAll && isNone) || (isAll && isRmgrId) || (isNone && isRmgrId))
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	*extra = (void *) newwalconsistency;
+
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..ed762b0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,11 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = 'none'		# Valid values are combinations of
+					# heap2, heap, btree, hash, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
+					# (change requires restart)
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..f962e79 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,maskPage) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..68192a7 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..8ec0eeb 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..3f8e7b7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..41772a5 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -21,5 +21,6 @@
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
+extern void hash_mask(char *page, BlockNumber blkno);
 
 #endif   /* HASH_XLOG_H */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5cd3022 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..006922a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..5509cab 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..fd6b9f5 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..295bf09 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..57756b8 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..74d5aa0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, not that if wal_consistency
+ * is enabled this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..6fd4130 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..874c25f
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..a5f34d3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -435,4 +435,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
 
+extern int comparePages(Page norm_new_page, Page norm_old_page);
+
 #endif   /* BUFPAGE_H */
#70Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#69)
Re: WAL consistency check facility

On Wed, Nov 2, 2016 at 10:23 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 1, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

IMHO, your rewrite of this patch was a bit heavy-handed.

OK... Sorry for that.

I haven't
scrutinized the code here so maybe it was a big improvement, and if so
fine, but if not it's better to collaborate with the author than to
take over.

While reviewing the code, that has finished by being a large rewrite,
and that was more understandable than a review looking at all the
small tweaks and things I have been through while reading it. I have
also experimented a couple of ideas with the patch that I added, so at
the end it proves to be a gain for everybody. I think that the last
patch is an improvement, if you want to make your own opinion on the
matter looking at the differences between both patches would be the
most direct way to go.

If my understanding is correct regarding this feature, last two patches
completely break the fundamental idea of wal consistency check feature.
I mentioned this in my last reply as well that we've to use some flag
to indicate
whether an image should be restored during replay or not. Otherwise,
XLogReadBufferForRedoExtended will always restore the image skipping the usual
redo operation. What's happening now is the following:
1. If wal_consistency is on, include backup block image with the wal record.
2. During replay, XLogReadBufferForRedoExtended always restores the backup block
image in local buffer since XLogRecHasBlockImage is true for each block.
3. In checkConsistency, you compare the local buffer with the backup block image
from the wal record. It'll always be consistent.
This feature aims to validate whether wal replay operation is
happening correctly or not.
To achieve that aim, we should not alter the wal replay operation itself.

Rest of the suggestions are well-taken. I'll update the patch accordingly.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#70)
Re: WAL consistency check facility

On Wed, Nov 2, 2016 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Wed, Nov 2, 2016 at 10:23 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 1, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

IMHO, your rewrite of this patch was a bit heavy-handed.

OK... Sorry for that.

I haven't
scrutinized the code here so maybe it was a big improvement, and if so
fine, but if not it's better to collaborate with the author than to
take over.

While reviewing the code, that has finished by being a large rewrite,
and that was more understandable than a review looking at all the
small tweaks and things I have been through while reading it. I have
also experimented a couple of ideas with the patch that I added, so at
the end it proves to be a gain for everybody. I think that the last
patch is an improvement, if you want to make your own opinion on the
matter looking at the differences between both patches would be the
most direct way to go.

If my understanding is correct regarding this feature, last two patches
completely break the fundamental idea of wal consistency check feature.
I mentioned this in my last reply as well that we've to use some flag
to indicate
whether an image should be restored during replay or not. Otherwise,
XLogReadBufferForRedoExtended will always restore the image skipping the usual
redo operation. What's happening now is the following:
1. If wal_consistency is on, include backup block image with the wal record.
2. During replay, XLogReadBufferForRedoExtended always restores the backup block
image in local buffer since XLogRecHasBlockImage is true for each block.
3. In checkConsistency, you compare the local buffer with the backup block image
from the wal record. It'll always be consistent.
This feature aims to validate whether wal replay operation is
happening correctly or not.
To achieve that aim, we should not alter the wal replay operation itself.

Hm... Right. That was broken. And actually, while the record-level
flag is useful so as you don't need to rely on checking
wal_consistency when doing WAL redo, the block-level flag is useful to
make a distinction between blocks that have to be replayed and the
ones that are used only for consistency, and both types could be mixed
in a record. Using it in bimg_info would be fine... Perhaps a better
name for the flag would be something like BKPIMAGE_APPLY, to mean that
the FPW needs to be applied at redo. Or BKPIMAGE_IGNORE, to bypass it
when replaying it. IS_REQUIRED_FOR_REDO is quite confusing.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#71)
Re: WAL consistency check facility

On Wed, Nov 2, 2016 at 1:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Hm... Right. That was broken. And actually, while the record-level
flag is useful so as you don't need to rely on checking
wal_consistency when doing WAL redo, the block-level flag is useful to
make a distinction between blocks that have to be replayed and the
ones that are used only for consistency, and both types could be mixed
in a record. Using it in bimg_info would be fine... Perhaps a better
name for the flag would be something like BKPIMAGE_APPLY, to mean that
the FPW needs to be applied at redo. Or BKPIMAGE_IGNORE, to bypass it
when replaying it. IS_REQUIRED_FOR_REDO is quite confusing.

BKPIMAGE_APPLY seems reasonable.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Kuntal Ghosh (#72)
1 attachment(s)
Re: WAL consistency check facility

On Wed, Nov 2, 2016 at 1:11 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Rest of the suggestions are well-taken. I'll update the patch accordingly.

I've updated the last submitted patch(v10) with the following changes:
- added a block level flag BKPIMAGE_APPLY to distinguish backup image
blocks which needs to be restored during replay.
- at present, hash index operations are not WAL-logged. Hence, I've removed
the consistency check option for hash indices. It can be added later.

Few comments:
- Michael suggested to use an integer variable and bitwise-shift
operation to store
the RMGR values instead of using a boolean array. But, boolean array
implementation looks cleaner to me. For example,
+if (wal_consistency[rmid])
+       rechdr->xl_info |= XLR_CHECK_CONSISTENCY;

+include_image = needs_backup || wal_consistency[rmid];

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v11.patchapplication/x-download; name=walconsistency_v11.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..ccf6409 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,38 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        By default, this setting does not contain any value. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..2af524d 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,38 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * If necessary, handle the case of meta and revmap pages here.
+	 */
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..f8604db 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,31 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (opaque->flags != GIN_META)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..f7abb9c 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,52 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a Gist page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/* Mask NSN */
+	GistPageSetNSN(page_norm, PG_UINT64_MAX);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record.  Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages, mask some line pointer bits, particularly
+		 * those marked as used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..3d8e5d3 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
+#include "storage/bufmask.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..72a43a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,65 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to the current block number and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..400df0d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,58 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..47c3467 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,23 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6b1f24e..c949580 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_string = NULL;
+bool	   *wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Aligned Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -867,6 +873,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -903,8 +910,9 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 	pg_crc32c	rdata_crc;
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
+	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
 	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   rechdr->xl_info == XLOG_SWITCH);
+							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 
@@ -1261,6 +1269,89 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (RmgrTable[rmid].rm_mask == NULL)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Do nothing. */
+			continue;
+		}
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, old_page_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(new_page_masked, new_page, BLCKSZ);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(new_page_masked, blkno);
+		RmgrTable[rmid].rm_mask(old_page_masked, blkno);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(new_page_masked, old_page_masked, BLCKSZ) != 0)
+			elog(FATAL,
+				 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+
+		ReleaseBuffer(buf);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6148,6 +6239,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	new_page_masked = (char *) palloc(BLCKSZ);
+	old_page_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6948,6 +7046,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -7478,6 +7585,12 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/* Clean up buffers dedicated to WAL consistency checks */
+	if (old_page_masked)
+		pfree(old_page_masked);
+	if (new_page_masked)
+		pfree(new_page_masked);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
@@ -7785,6 +7898,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report)
 {
 	XLogRecord *record;
+	uint8		info;
 
 	if (!XRecOffIsValid(RecPtr))
 	{
@@ -7810,6 +7924,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 	}
 
 	record = ReadRecord(xlogreader, RecPtr, LOG, true);
+	info = record->xl_info & ~XLR_INFO_MASK;
 
 	if (record == NULL)
 	{
@@ -7852,8 +7967,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		}
 		return NULL;
 	}
-	if (record->xl_info != XLOG_CHECKPOINT_SHUTDOWN &&
-		record->xl_info != XLOG_CHECKPOINT_ONLINE)
+	if (info != XLOG_CHECKPOINT_SHUTDOWN &&
+		info != XLOG_CHECKPOINT_ONLINE)
 	{
 		switch (whichChkpt)
 		{
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..7c4684d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -414,10 +414,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -513,6 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image; /* Whether backup image should be included in WAL record */
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +559,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current resource manager, log a full-page write for the current block.
+		 */
+		include_image = needs_backup || wal_consistency[rmid];
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -618,6 +627,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * Remember that, if WAL consistency check is enabled for the current rmid,
+			 * we always include backup image with the WAL record. But, during redo we
+			 * restore the backup block only if needs_backup is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -680,7 +698,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
@@ -756,6 +774,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	rechdr->xl_prev = InvalidXLogRecPtr;
 	rechdr->xl_crc = rdata_crc;
 
+	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it.
+	 */
+	if (wal_consistency[rmid])
+		rechdr->xl_info |= XLR_CHECK_CONSISTENCY;
+
 	return &hdr_rdt;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f2da505..56d4c66 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -462,7 +462,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	/*
 	 * Special processing if it's an XLOG SWITCH record
 	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	if (record->xl_rmid == RM_XLOG_ID &&
+		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
 		state->EndRecPtr += XLogSegSize - 1;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..09a7722 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -352,8 +352,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If full-page image should be restored, do it. */
+	if (XLogRecBlockImageApply(record, block_id))
 	{
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..864d6a9 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..6c67e3e
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,78 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * Mask Page LSN
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+}
+
+/*
+ * Mask hint bits in PageHeader
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = PG_UINT32_MAX;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..3525b04 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -145,6 +147,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3254,6 +3259,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			 NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9867,6 +9882,118 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   	*rawstring;
+	List	   	*elemlist;
+	ListCell   	*l;
+	bool		*newwalconsistency;
+	bool		isRmgrId = false;	/* Does this guc include any
+							* individual resource manager? */
+	bool		isAll = false;	/* Does this guc include 'all' keyword? */
+	int		i;
+
+	newwalconsistency = (bool *) guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Initialize the array*/
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char		*tok = (char *) lfirst(l);
+		bool		found = false;
+
+		/* Check if the token matches with any individual resource manager */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if maskPage function
+				 * is defined for this resource manager. We'll enable this feature
+				 * only for the resource managers for which a masking function
+				 * is defined.
+				 */
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+		/* If a valid resource manager is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual resource manager. Check for 'all'. */
+		if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * We'll enable this feature only for the resource managers for which
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/* guc should contain either 'all' or combination of resource managers. */
+	if (isAll && isRmgrId)
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	*extra = (void *) newwalconsistency;
+
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..ca734fe 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,10 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = ''			# Valid values are combinations of
+					# heap2, heap, btree, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..5d19a4a 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..68192a7 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..8ec0eeb 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..3f8e7b7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5cd3022 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..006922a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..89182e2 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..fd6b9f5 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..295bf09 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..57756b8 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..16c785f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,9 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id) \
+	(XLogRecHasBlockImage(decoder, block_id) && \
+	(((decoder)->blocks[block_id].bimg_info & BKPIMAGE_APPLY) > 0))
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..9e8ff3f 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, not that if wal_consistency
+ * is enabled this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
@@ -137,6 +146,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_APPLY		0x04		/* page image should be restored during replay */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..6fd4130 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..874c25f
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
#74Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#73)
Re: WAL consistency check facility

On Wed, Nov 2, 2016 at 11:30 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Wed, Nov 2, 2016 at 1:11 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Rest of the suggestions are well-taken. I'll update the patch accordingly.

I've updated the last submitted patch(v10) with the following changes:
- added a block level flag BKPIMAGE_APPLY to distinguish backup image
blocks which needs to be restored during replay.
- at present, hash index operations are not WAL-logged. Hence, I've removed
the consistency check option for hash indices. It can be added later.

Both make sense.

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

An extra idea that I have here would be to extend the TAP tests to
accept an environment variable that would be used to specify extra
options when starting Postgres instances. Buildfarm machines could use
it.

+            /*
+             * Remember that, if WAL consistency check is enabled for
the current rmid,
+             * we always include backup image with the WAL record.
But, during redo we
+             * restore the backup block only if needs_backup is set.
+             */
+            if (needs_backup)
+                bimg.bimg_info |= BKPIMAGE_APPLY;
+
+
You should be careful about extra newlines and noise in the code.
-    /* If it's a full-page image, restore it. */
-    if (XLogRecHasBlockImage(record, block_id))
+    /* If full-page image should be restored, do it. */
+    if (XLogRecBlockImageApply(record, block_id))
Hm. It seems to me that this modification is incorrect. If the page
does not need to be applied, aka if it needs to be used for
consistency checks, what should be done is more something like the
following in XLogReadBufferForRedoExtended:
if (Apply(record, block_id))
    return;
if (HasImage)
{
    //do what needs to be done with an image
}
else
{
    //do default things
}

XLogRecBlockImageApply() should only check for BKP_APPLY and not imply
HasImage(). This will be more flexible when for example using it for
assertions.

With this patch the comments on top of XLogReadBufferForRedo are
wrong. A block is not unconditionally applied.

+#define XLogRecBlockImageApply(decoder, block_id) \
+    (XLogRecHasBlockImage(decoder, block_id) && \
+    (((decoder)->blocks[block_id].bimg_info & BKPIMAGE_APPLY) > 0))
Maybe != 0? That's the most common practice in the code.

It would be more consistent to have a boolean flag to treat
BKPIMAGE_APPLY in xlogreader.c. Have a look at has_image for example.
This will as well reduce dependencies on the header xlog_record.h

+            /*
+             * For a speculative tuple, the content of t_ctid is conflicting
+             * between the backup page and current page. Hence, we set it
+             * to the current block number and current offset.
+             */
+            if (HeapTupleHeaderIsSpeculative(page_htup))
+                ItemPointerSet(&page_htup->t_ctid, blkno, off);
In the set of masking functions this is the only portion of the code
depending on blkno. But isn't that actually a bug in speculative
inserts? Andres (added now in CC), Peter, could you provide some input
regarding that? The masking functions should not prevent the detection
of future errors, and this code is making me uneasy.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#74)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 2:35 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

An extra idea that I have here would be to extend the TAP tests to
accept an environment variable that would be used to specify extra
options when starting Postgres instances. Buildfarm machines could use
it.

It can be added as a separate feature.

-    /* If it's a full-page image, restore it. */
-    if (XLogRecHasBlockImage(record, block_id))
+    /* If full-page image should be restored, do it. */
+    if (XLogRecBlockImageApply(record, block_id))
Hm. It seems to me that this modification is incorrect. If the page
does not need to be applied, aka if it needs to be used for
consistency checks, what should be done is more something like the
following in XLogReadBufferForRedoExtended:
if (Apply(record, block_id))
return;
if (HasImage)
{
//do what needs to be done with an image
}
else
{
//do default things
}

XLogReadBufferForRedoExtended should return a redo action
(block restored, done, block needs redo or block not found). So, we
can't just return
from the function if BLKIMAGE_APPLY flag is not set. It still has to
check whether a
redo is required or not.

XLogRecBlockImageApply() should only check for BKP_APPLY and not imply
HasImage(). This will be more flexible when for example using it for
assertions.

seems reasonable.

It would be more consistent to have a boolean flag to treat
BKPIMAGE_APPLY in xlogreader.c. Have a look at has_image for example.

For flags in bimg_info, we directly check if the mask bit is set in bimg_info
(ex: hole, compressed). Besides, we use this flag only at
XLogReadBufferForRedoExtended.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#75)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 3:24 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Thu, Nov 3, 2016 at 2:35 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

An extra idea that I have here would be to extend the TAP tests to
accept an environment variable that would be used to specify extra
options when starting Postgres instances. Buildfarm machines could use
it.

It can be added as a separate feature.

-    /* If it's a full-page image, restore it. */
-    if (XLogRecHasBlockImage(record, block_id))
+    /* If full-page image should be restored, do it. */
+    if (XLogRecBlockImageApply(record, block_id))
Hm. It seems to me that this modification is incorrect. If the page
does not need to be applied, aka if it needs to be used for
consistency checks, what should be done is more something like the
following in XLogReadBufferForRedoExtended:
if (Apply(record, block_id))
return;
if (HasImage)
{
//do what needs to be done with an image
}
else
{
//do default things
}

XLogReadBufferForRedoExtended should return a redo action
(block restored, done, block needs redo or block not found). So, we
can't just return
from the function if BLKIMAGE_APPLY flag is not set. It still has to
check whether a
redo is required or not.

Wouldn't the definition of a new redo action make sense then? Say
SKIPPED. None of the existing actions match the non-apply case.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#76)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 12:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 3, 2016 at 3:24 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Thu, Nov 3, 2016 at 2:35 AM, Michael Paquier

-    /* If it's a full-page image, restore it. */
-    if (XLogRecHasBlockImage(record, block_id))
+    /* If full-page image should be restored, do it. */
+    if (XLogRecBlockImageApply(record, block_id))
Hm. It seems to me that this modification is incorrect. If the page
does not need to be applied, aka if it needs to be used for
consistency checks, what should be done is more something like the
following in XLogReadBufferForRedoExtended:
if (Apply(record, block_id))
return;
if (HasImage)
{
//do what needs to be done with an image
}
else
{
//do default things
}

XLogReadBufferForRedoExtended should return a redo action
(block restored, done, block needs redo or block not found). So, we
can't just return
from the function if BLKIMAGE_APPLY flag is not set. It still has to
check whether a
redo is required or not.

Wouldn't the definition of a new redo action make sense then? Say
SKIPPED. None of the existing actions match the non-apply case.

As per my understanding, XLogReadBufferForRedoExtended works as follows:
1. If wal record has backup block
2. {
3. restore the backup block;
4. return BLK_RESTORED;
5. }
6. else
7. {
8. If block found in buffer
10. If lsn of block is less than last replayed record
11. return BLK_DONE;
12. else
13. return BLK_NEEDS_REDO;
14. else
15. return BLK_NOT_FOUND;
16. }
Now, we can just change step 1 as follows:
1. If wal record has backup block and it needs to be restored.

I'm not getting why we should introduce a new redo action and return
from the function beforehand.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#76)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 4:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Wouldn't the definition of a new redo action make sense then? Say
SKIPPED. None of the existing actions match the non-apply case.

I just took 5 minutes to look at the code and reason about it, and
something like what your patch is doing would be actually fine. Still
I don't think that checking for the apply flag in the macro routine
should look for has_image. Let's keep things separate.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#78)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 2:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 3, 2016 at 4:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Wouldn't the definition of a new redo action make sense then? Say
SKIPPED. None of the existing actions match the non-apply case.

I just took 5 minutes to look at the code and reason about it, and
something like what your patch is doing would be actually fine. Still
I don't think that checking for the apply flag in the macro routine
should look for has_image. Let's keep things separate.

Actually, I just verified that bimg_info is not even valid if
has_image is not set.
In DecodeXLogRecord, we initialize bimg_info only when has_image flag
is set. So, keeping them
separate doesn't look a good approach to me. If we keep them separate,
the output
of the following assert is undefined:
Assert(XLogRecHasBlockImage(record, block_id) ||
!XLogRecBlockImageApply(record, block_id)).

Thoughts??
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#77)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 5:56 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

I'm not getting why we should introduce a new redo action and return
from the function beforehand.

Per my last email, same conclusion from here :)
Sorry if I am picky and noisy on many points, I am trying to think
about the value of each change introduced in this patch, particularly
if they are meaningful, can be improved in some way, or can be
simplified and make the code more simple.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#79)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 6:15 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Actually, I just verified that bimg_info is not even valid if
has_image is not set.
In DecodeXLogRecord, we initialize bimg_info only when has_image flag
is set. So, keeping them
separate doesn't look a good approach to me. If we keep them separate,
the output
of the following assert is undefined:
Assert(XLogRecHasBlockImage(record, block_id) ||
!XLogRecBlockImageApply(record, block_id)).

Thoughts??

Yes, that's exactly the reason why we should keep both macros as
checking for separate things: apply implies that has_image is set and
that's normal, hence we could use sanity checks by just using those
macros and not propagating xlogreader.h.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#81)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 2:52 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 3, 2016 at 6:15 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Actually, I just verified that bimg_info is not even valid if
has_image is not set.
In DecodeXLogRecord, we initialize bimg_info only when has_image flag
is set. So, keeping them
separate doesn't look a good approach to me. If we keep them separate,
the output
of the following assert is undefined:
Assert(XLogRecHasBlockImage(record, block_id) ||
!XLogRecBlockImageApply(record, block_id)).

Thoughts??

Yes, that's exactly the reason why we should keep both macros as
checking for separate things: apply implies that has_image is set and
that's normal, hence we could use sanity checks by just using those
macros and not propagating xlogreader.h.

No, apply doesn't mean has_image is set. If has_image is not set,
apply/bimg_info
is invalid(/undefined) and we should not use that. For example, in
XLogDumpDisplayRecord we use
bimg_info as following,
if (XLogRecHasBlockImage(record, block_id))
{
if (record->blocks[block_id].bimg_info & BKPIMAGE_IS_COMPRESSED)
{
}
}
So, whenever we are required to use bimg_info flag, we should make
sure that has_image
is set.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#82)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 6:48 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

So, whenever we are required to use bimg_info flag, we should make
sure that has_image
is set.

OK, we are taking past each other here. There are two possible patterns:
- has_image is set, not apply, meaning that the image block is used
for consistency checks.
- has_image is set, as well as apply, meaning that the block needs to
be applied at redo.
So I mean exactly the same thing as you do. The point I am trying to
raise is that it would be meaningful to put in some code paths checks
of the type (apply && !has_image) and ERROR on them. Perhaps we could
just do that in xlogreader.c though. If having those checks external
to xlogreader.c makes sense, then using separate macros is more
portable.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#83)
1 attachment(s)
Re: WAL consistency check facility

I've updated the patch for review.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v12.patchapplication/x-download; name=walconsistency_v12.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..ccf6409 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,38 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        By default, this setting does not contain any value. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..2af524d 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,38 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * If necessary, handle the case of meta and revmap pages here.
+	 */
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..f8604db 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,31 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (opaque->flags != GIN_META)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..f7abb9c 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,52 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a Gist page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/* Mask NSN */
+	GistPageSetNSN(page_norm, PG_UINT64_MAX);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record.  Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages, mask some line pointer bits, particularly
+		 * those marked as used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..3d8e5d3 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
+#include "storage/bufmask.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..72a43a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,65 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to the current block number and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..400df0d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,58 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..47c3467 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,23 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+
+	/*
+	 * Mask the Page LSN. Because, we store the page before updating the LSN.
+	 * Hence, LSNs of both pages will always be different.
+	 */
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6b1f24e..eeb5850 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_string = NULL;
+bool	   *wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Aligned Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -867,6 +873,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -903,8 +910,9 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 	pg_crc32c	rdata_crc;
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
+	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
 	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   rechdr->xl_info == XLOG_SWITCH);
+							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 
@@ -1261,6 +1269,94 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (RmgrTable[rmid].rm_mask == NULL)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+
+		/* If we've just restored the block from backup image, skip consistency check. */
+		if (XLogRecHasBlockImage(record, block_id) &&
+			XLogRecBlockImageApply(record, block_id))
+			continue;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Do nothing. */
+			continue;
+		}
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, old_page_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(new_page_masked, new_page, BLCKSZ);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(new_page_masked, blkno);
+		RmgrTable[rmid].rm_mask(old_page_masked, blkno);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(new_page_masked, old_page_masked, BLCKSZ) != 0)
+			elog(LOG,
+				 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+
+		ReleaseBuffer(buf);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6148,6 +6244,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	new_page_masked = (char *) palloc(BLCKSZ);
+	old_page_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6948,6 +7051,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -7478,6 +7590,12 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/* Clean up buffers dedicated to WAL consistency checks */
+	if (old_page_masked)
+		pfree(old_page_masked);
+	if (new_page_masked)
+		pfree(new_page_masked);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
@@ -7785,6 +7903,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report)
 {
 	XLogRecord *record;
+	uint8		info;
 
 	if (!XRecOffIsValid(RecPtr))
 	{
@@ -7810,6 +7929,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 	}
 
 	record = ReadRecord(xlogreader, RecPtr, LOG, true);
+	info = record->xl_info & ~XLR_INFO_MASK;
 
 	if (record == NULL)
 	{
@@ -7852,8 +7972,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		}
 		return NULL;
 	}
-	if (record->xl_info != XLOG_CHECKPOINT_SHUTDOWN &&
-		record->xl_info != XLOG_CHECKPOINT_ONLINE)
+	if (info != XLOG_CHECKPOINT_SHUTDOWN &&
+		info != XLOG_CHECKPOINT_ONLINE)
 	{
 		switch (whichChkpt)
 		{
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..a45766f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -414,10 +414,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -513,6 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image; /* Whether backup image should be included in WAL record */
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +559,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current resource manager, log a full-page write for the current block.
+		 */
+		include_image = needs_backup || wal_consistency[rmid];
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -618,6 +627,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * Remember that, if WAL consistency check is enabled for the current rmid,
+			 * we always include backup image with the WAL record. But, during redo we
+			 * restore the backup block only if needs_backup is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -680,7 +697,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
@@ -756,6 +773,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	rechdr->xl_prev = InvalidXLogRecPtr;
 	rechdr->xl_crc = rdata_crc;
 
+	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it.
+	 */
+	if (wal_consistency[rmid])
+		rechdr->xl_info |= XLR_CHECK_CONSISTENCY;
+
 	return &hdr_rdt;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f2da505..56d4c66 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -462,7 +462,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	/*
 	 * Special processing if it's an XLOG SWITCH record
 	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	if (record->xl_rmid == RM_XLOG_ID &&
+		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
 		state->EndRecPtr += XLogSegSize - 1;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..bc9c328 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -275,8 +275,8 @@ XLogCheckInvalidPages(void)
  * will complain if we don't have the lock.  In hot standby mode it's
  * definitely necessary.)
  *
- * Note: when a backup block is available in XLOG, we restore it
- * unconditionally, even if the page in the database appears newer.  This is
+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag set,
+ * we restore it, even if the page in the database appears newer.  This is
  * to protect ourselves against database pages that were partially or
  * incorrectly written during a crash.  We assume that the XLOG data must be
  * good because it has passed a CRC check, while the database page might not
@@ -310,9 +310,9 @@ XLogInitBufferForRedo(XLogReaderState *record, uint8 block_id)
  * XLogReadBufferForRedoExtended
  *		Like XLogReadBufferForRedo, but with extra options.
  *
- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if BKPIMAGE_APPLY flag is not set for the backup block,
+ * the relation is extended with all-zeroes pages up to the referenced block number.
+ * In RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
  * is always BLK_NEEDS_REDO.
  *
  * (The RBM_ZERO_AND_CLEANUP_LOCK mode is redundant with the get_cleanup_lock
@@ -352,8 +352,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If it has a full-page image and it should be restored, do it. */
+	if (XLogRecHasBlockImage(record, block_id) && XLogRecBlockImageApply(record, block_id))
 	{
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..864d6a9 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..6c67e3e
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,78 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to ensure that buffers used for
+ *	  comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * Mask Page LSN
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+}
+
+/*
+ * Mask hint bits in PageHeader
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = PG_UINT32_MAX;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..a4a7f57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -145,6 +147,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3254,6 +3259,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			 NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9867,6 +9882,118 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   	*rawstring;
+	List	   	*elemlist;
+	ListCell   	*l;
+	bool		*newwalconsistency;
+	bool		isRmgrId = false;	/* Does this guc include any
+							* individual resource manager? */
+	bool		isAll = false;	/* Does this guc include 'all' keyword? */
+	int		i;
+
+	newwalconsistency = (bool *) guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Initialize the array*/
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char		*tok = (char *) lfirst(l);
+		bool		found = false;
+
+		/* Check if the token matches with any individual resource manager */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if mask function
+				 * is defined for this resource manager. We'll enable this feature
+				 * only for the resource managers for which a masking function
+				 * is defined.
+				 */
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+		/* If a valid resource manager is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual resource manager. Check for 'all'. */
+		if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * We'll enable this feature only for the resource managers for which
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/* guc should contain either 'all' or combination of resource managers. */
+	if (isAll && isRmgrId)
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	*extra = (void *) newwalconsistency;
+
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..ca734fe 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,10 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = ''			# Valid values are combinations of
+					# heap2, heap, btree, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..5d19a4a 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..68192a7 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..8ec0eeb 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..3f8e7b7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5cd3022 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..006922a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..89182e2 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..fd6b9f5 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..295bf09 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..57756b8 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..46b85a8 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id) \
+	(((decoder)->blocks[block_id].bimg_info & BKPIMAGE_APPLY) != 0)
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..9e8ff3f 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, not that if wal_consistency
+ * is enabled this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
@@ -137,6 +146,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_APPLY		0x04		/* page image should be restored during replay */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..6fd4130 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..874c25f
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
#85Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Kuntal Ghosh (#84)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 7:47 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

I've updated the patch for review.

If an inconsistency is found, it'll just log it for now. Once, the
patch is finalized, we can
change it to FATAL as before. I was making sure that all regression
tests should pass with the patch.
It seems that there is some inconsistency in regression tests for BRIN index.

LOG: Inconsistent page found, rel 1663/16384/30607, forknum 0, blkno 1
CONTEXT: xlog redo at 0/9BAE08C8 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#73)
Re: WAL consistency check facility

On Wed, Nov 2, 2016 at 10:30 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

I think running the whole test suite with this enabled is going to
provoke complaints from buildfarm owners. That's too bad, because I
agree with you that it would be nice to have the test coverage, but it
seems that many of the buildfarm machines are VMs with very minimal
resource allocations -- or very old physical machines -- or running
with settings like CLOBBER_CACHE_ALWAYS that make runs very slow. If
you blow on them too hard, they fall over.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#86)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 2, 2016 at 10:30 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

I think running the whole test suite with this enabled is going to
provoke complaints from buildfarm owners. That's too bad, because I
agree with you that it would be nice to have the test coverage, but it
seems that many of the buildfarm machines are VMs with very minimal
resource allocations -- or very old physical machines -- or running
with settings like CLOBBER_CACHE_ALWAYS that make runs very slow. If
you blow on them too hard, they fall over.

Thanks Robert. I got your point. Then, as Michael has suggested, it is nice to
have some environment variable to pass optional conf parameters during
tap-tests.
Implementing this feature actually solves the problem.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#87)
Re: WAL consistency check facility

On Fri, Nov 4, 2016 at 4:16 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Thu, Nov 3, 2016 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 2, 2016 at 10:30 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

I think running the whole test suite with this enabled is going to
provoke complaints from buildfarm owners. That's too bad, because I
agree with you that it would be nice to have the test coverage, but it
seems that many of the buildfarm machines are VMs with very minimal
resource allocations -- or very old physical machines -- or running
with settings like CLOBBER_CACHE_ALWAYS that make runs very slow. If
you blow on them too hard, they fall over.

Count me in. My RPIs won't like that! Actually I have a couple of
things internally mimicking the buildfarm client code on machines with
far higher capacity. And FWIW I am definitely going to enable this
option in the test suite, finishing with reports here.

Thanks Robert. I got your point. Then, as Michael has suggested, it is nice to
have some environment variable to pass optional conf parameters during
tap-tests.
Implementing this feature actually solves the problem.

We just need make PostgresNode.pm aware of something like PGTAPOPTIONS
to enforce a server started by TAP tests to append options to it.
There is already PGCTLTIMEOUT that behaves similarly. Even if this
brings extra load to buildfarm owners, that will limit complaints.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#84)
Re: WAL consistency check facility

On Thu, Nov 3, 2016 at 11:17 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've updated the patch for review.

Thank you for the new patch. This will be hopefully the last round of
reviews, we are getting close to something that has an acceptable
shape.

+       </para>
+      </listitem>
+     </varlistentry>
+
+      </listitem>
+     </varlistentry>
Did you try to compile the docs? Because that will break. (Likely my
fault). What needs to be done is removing one </listitem> and one
</varlistentry> markup.
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
We could likely survive here with just a copyright mention as 2016,
PGDG... Same remark for bufmask.c.
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
+#include "storage/bufmask.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
This header declaration is not necessary.
+   /*
+    * Mask the Page LSN. Because, we store the page before updating the LSN.
+    * Hence, LSNs of both pages will always be different.
+    */
+   mask_page_lsn(page_norm);
I don't fully understand this comment if phrased this way. Well, I do
understand it, but people who would read this code for the first time
may have a hard time understanding it. So I would suggest removing it,
but add a comment on top of mask_page_lsn() to mention that in
consistency checks the LSN of the two pages compared will likely be
different because of concurrent operations when the WAL is generated
and the state of the page where WAL is applied.
+   maskopaq = (BTPageOpaque)
+           (((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+   /*
+    * Mask everything on a DELETED page.
+    */
Let's make the code breath and add a space here.
+/* Aligned Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
palloc'd buffers are aligned, so you could just remove the work
"Aligned" in this comment?
+       /* If we've just restored the block from backup image, skip
consistency check. */
+       if (XLogRecHasBlockImage(record, block_id) &&
+           XLogRecBlockImageApply(record, block_id))
+           continue;
Here you could just check for Apply() to decide if continue should be
called or not, and Assert afterwards on HasBlockImage(). The assertion
would help checking for inconsistency errors.

@@ -7810,6 +7929,7 @@ ReadCheckpointRecord(XLogReaderState
*xlogreader, XLogRecPtr RecPtr,
}

record = ReadRecord(xlogreader, RecPtr, LOG, true);
+ info = record->xl_info & ~XLR_INFO_MASK;

    if (record == NULL)
    {
@@ -7852,8 +7972,8 @@ ReadCheckpointRecord(XLogReaderState
*xlogreader, XLogRecPtr RecPtr,
        }
        return NULL;
    }
-   if (record->xl_info != XLOG_CHECKPOINT_SHUTDOWN &&
-       record->xl_info != XLOG_CHECKPOINT_ONLINE)
+   if (info != XLOG_CHECKPOINT_SHUTDOWN &&
+       info != XLOG_CHECKPOINT_ONLINE)
Those changes are not directly related to this patch, but make sure
that record checks are done correctly or this patch would just fail.
It may be better to apply those changes independently first per the
patch on this thread:
https://www.postgresql.org/message-id/CAB7nPqSWVyaZJg7_amRKVqRpEP=_=54e+762+2PF9u3Q3+Z0Nw@mail.gmail.com
My recommendation is to do so.
+           /*
+            * Remember that, if WAL consistency check is enabled for
the current rmid,
+            * we always include backup image with the WAL record.
But, during redo we
+            * restore the backup block only if needs_backup is set.
+            */
This could be rewritten a bit:
"if WAL consistency is enabled for the resource manager of this WAL
record, a full-page image is included in the record for the block
modified. During redo, the full-page is replayed only of
BKPIMAGE_APPLY is set."
- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if BKPIMAGE_APPLY flag is not set for the backup block,
+ * the relation is extended with all-zeroes pages up to the
referenced block number.
+ * In RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
  * is always BLK_NEEDS_REDO
You are forgetting to mention "if the page does not exist" in the new
comment block.
+   /* If it has a full-page image and it should be restored, do it. */
+   if (XLogRecHasBlockImage(record, block_id) &&
XLogRecBlockImageApply(record, block_id))
    {
Perhaps on two lines?

The headers of the functions in bufmask.c could be more descriptive,
there should be explanations regarding in which aspect they are useful
to guide the user in using them wisely (linked to my comment upstread
if the badly formulated comments before called mask_page_lsn).

Something regarding check_wal_consistency is making uneasy... But I
can't put my finger on what that is now..

I would still for the removal of blkno in the list of arguments of the
masking functions. This is used just for speculative inserts, where we
could just enforce the page number to 0 because this does not matter,
as Peter has mentioned upthread.

Could it be possible to add in pg_xlogdump.c a mention about a FPW
that has the "apply" flag. That would be important for debugging and
development. You could just have for example "(FPW)" for a page that
won't be applied, and "(FPW) apply" for a page where the apply flag is
active.

Please update gindesc.c for FPWs that have the apply flag, issue found
while checking the callers of XLogRecHasBlockImage().

In DecodeXLogRecord@xlogreader.c, please add a boolean flag "apply"
and then please could you do some error checks on it. Only one is
needed: if "apply" is true and has_image is false, xlogreader.c should
complain about an inconsistency!

I haven't performed any tests with the patch, and that's all I have
regarding the code. With that done we should be in good shape
code-speaking I think...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#89)
Re: WAL consistency check facility

On Fri, Nov 4, 2016 at 5:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 3, 2016 at 11:17 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've updated the patch for review.

Thank you for the new patch. This will be hopefully the last round of
reviews, we are getting close to something that has an acceptable
shape.

One last thing: in XLogRecordAssemble(), could you enforce the value
of info at the beginning of the routine when wal_consistency[rmid] is
true? And then use the value of info to decide if include_image is
true or not? The idea here is to allow callers of XLogInsert() to pass
by themselves XLR_CHECK_CONSISTENCY and still have consistency checks
enabled for a given record even if wal_consistency is false for the
rmgr of the record happening. This would be potentially useful for
extension and feature developers when debugging some stuff, for some
builds compiled with a DEBUG flag, or whatever.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#90)
Re: WAL consistency check facility

On Fri, Nov 4, 2016 at 6:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Nov 4, 2016 at 5:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 3, 2016 at 11:17 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've updated the patch for review.

Thank you for the new patch. This will be hopefully the last round of
reviews, we are getting close to something that has an acceptable
shape.

One last thing: in XLogRecordAssemble(), could you enforce the value
of info at the beginning of the routine when wal_consistency[rmid] is
true? And then use the value of info to decide if include_image is
true or not? The idea here is to allow callers of XLogInsert() to pass
by themselves XLR_CHECK_CONSISTENCY and still have consistency checks
enabled for a given record even if wal_consistency is false for the
rmgr of the record happening. This would be potentially useful for
extension and feature developers when debugging some stuff, for some
builds compiled with a DEBUG flag, or whatever.

And you need to rebase the patch, d5f6f13 has fixed the handling of
xl_info with XLR_INFO_MASK.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#91)
1 attachment(s)
Re: WAL consistency check facility

On Fri, Nov 4, 2016 at 1:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Thank you for the new patch. This will be hopefully the last round of
reviews, we are getting close to something that has an acceptable
shape.

Thanks a lot for reviewing the patch. Based on your review, I've attached the
updated patch along with few comments.

In DecodeXLogRecord@xlogreader.c, please add a boolean flag "apply"
and then please could you do some error checks on it. Only one is
needed: if "apply" is true and has_image is false, xlogreader.c should
complain about an inconsistency!

Added a flag named apply_image in DecodedBkpBlock and used it to
check whether image apply is required or not.

I would still for the removal of blkno in the list of arguments of the
masking functions. This is used just for speculative inserts, where we
could just enforce the page number to 0 because this does not matter,
as Peter has mentioned upthread.

It just doesn't feel right to me to enforce the number manually when
I can use the blkno without any harm.

I haven't performed any tests with the patch, and that's all I have
regarding the code. With that done we should be in good shape
code-speaking I think...

I've done a fair amount of testing which includes regression tests
and manual creation of inconsistencies in the page. I've also found the
reason behind inconsistency in brin revmap page.
Brin revmap page doesn't have standard page layout. Besides, It doesn't update
pd_upper and pd_lower in its operations as well. But, during WAL
insertions, it uses
REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
restore image before consistency check, RestoreBlockImage fills the space
between pd_upper and pd_lower(page hole) with zero. I've posted this as a
separate thread.
/messages/by-id/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v13.patchapplication/x-download; name=walconsistency_v13.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..57660d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,35 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        By default, this setting does not contain any value. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..2af524d 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,38 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * If necessary, handle the case of meta and revmap pages here.
+	 */
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..f8604db 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,31 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (opaque->flags != GIN_META)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..f7abb9c 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,52 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a Gist page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/* Mask NSN */
+	GistPageSetNSN(page_norm, PG_UINT64_MAX);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record.  Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages, mask some line pointer bits, particularly
+		 * those marked as used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..1744b29 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,61 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to the current block number and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..6aabbad 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,55 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque)
+			(((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index db832a5..66107a0 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -113,7 +113,10 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 					(ginxlogRecompressDataLeaf *) payload;
 
 					if (XLogRecHasBlockImage(record, 0))
-						appendStringInfoString(buf, " (full page image)");
+						if (XLogRecBlockImageApply(record, 0))
+							appendStringInfoString(buf, " (full page image, apply)");
+						else
+							appendStringInfoString(buf, " (full page image)");
 					else
 						desc_recompress_leaf(buf, insertData);
 				}
@@ -147,7 +150,10 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 				ginxlogVacuumDataLeafPage *xlrec = (ginxlogVacuumDataLeafPage *) rec;
 
 				if (XLogRecHasBlockImage(record, 0))
-					appendStringInfoString(buf, " (full page image)");
+					if (XLogRecBlockImageApply(record, 0))
+						appendStringInfoString(buf, " (full page image, apply)");
+					else
+						appendStringInfoString(buf, " (full page image)");
 				else
 					desc_recompress_leaf(buf, &xlrec->data);
 			}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..f66f73a 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,19 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..3b5ddd6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_string = NULL;
+bool	   *wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -867,6 +873,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1262,6 +1269,95 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (RmgrTable[rmid].rm_mask == NULL)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Do nothing. */
+			continue;
+		}
+
+		Assert(XLogRecHasBlockImage(record, block_id));
+
+		/* If we've just restored the block from backup image, skip consistency check. */
+		if (XLogRecBlockImageApply(record, block_id))
+			continue;
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, old_page_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(new_page_masked, new_page, BLCKSZ);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(new_page_masked, blkno);
+		RmgrTable[rmid].rm_mask(old_page_masked, blkno);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(new_page_masked, old_page_masked, BLCKSZ) != 0)
+			elog(FATAL,
+				 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+
+		ReleaseBuffer(buf);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6149,6 +6245,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	new_page_masked = (char *) palloc(BLCKSZ);
+	old_page_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6949,6 +7052,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -7479,6 +7591,12 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/* Clean up buffers dedicated to WAL consistency checks */
+	if (old_page_masked)
+		pfree(old_page_masked);
+	if (new_page_masked)
+		pfree(new_page_masked);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..23d8ac5 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -414,10 +414,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -498,6 +500,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	hdr_rdt.data = hdr_scratch;
 
 	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it.
+	 */
+	if (wal_consistency[rmid])
+		info |= XLR_CHECK_CONSISTENCY;
+
+	/*
 	 * Make an rdata chain containing all the data portions of all block
 	 * references. This includes the data for full-page images. Also append
 	 * the headers for the block references in the scratch buffer.
@@ -513,6 +522,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image; /* Whether backup image should be included in WAL record */
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +566,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current resource manager, log a full-page write for the current block.
+		 */
+		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY);
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -618,6 +634,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * If WAL consistency is enabled for the resource manager of this WAL
+			 * record, a full-page image is included in the record for the block
+			 * modified. During redo, the full-page is replayed only of
+			 * BKPIMAGE_APPLY is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -680,7 +705,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 56d4c66..9ed78cf 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1089,6 +1089,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			blk = &state->blocks[block_id];
 			blk->in_use = true;
+			blk->apply_image = false;
 
 			COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
 			blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
@@ -1120,6 +1121,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+
+				blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
+
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1243,6 +1247,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (!blk->in_use)
 			continue;
+
+		Assert(blk->has_image || !blk->apply_image);
+
 		if (blk->has_image)
 		{
 			blk->bkp_image = ptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..85ff838 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -275,8 +275,8 @@ XLogCheckInvalidPages(void)
  * will complain if we don't have the lock.  In hot standby mode it's
  * definitely necessary.)
  *
- * Note: when a backup block is available in XLOG, we restore it
- * unconditionally, even if the page in the database appears newer.  This is
+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag set,
+ * we restore it, even if the page in the database appears newer.  This is
  * to protect ourselves against database pages that were partially or
  * incorrectly written during a crash.  We assume that the XLOG data must be
  * good because it has passed a CRC check, while the database page might not
@@ -310,9 +310,11 @@ XLogInitBufferForRedo(XLogReaderState *record, uint8 block_id)
  * XLogReadBufferForRedoExtended
  *		Like XLogReadBufferForRedo, but with extra options.
  *
- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.
+ *
+ * In RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
  * is always BLK_NEEDS_REDO.
  *
  * (The RBM_ZERO_AND_CLEANUP_LOCK mode is redundant with the get_cleanup_lock
@@ -352,9 +354,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If it has a full-page image and it should be restored, do it. */
+	if (XLogRecBlockImageApply(record, block_id))
 	{
+		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
 		page = BufferGetPage(*buf);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..864d6a9 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..282d67a
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,82 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking, used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied. So, we mask those bits before any
+ *	  page comparison to make them consistent.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * In consistency checks, the LSN of the two pages compared will likely be
+ * different because of concurrent operations when the WAL is generated
+ * and the state of the page when WAL is applied.
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+}
+
+/*
+ * Mask hint bits in PageHeader
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = PG_UINT32_MAX;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..a4a7f57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -145,6 +147,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3254,6 +3259,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			 NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9867,6 +9882,118 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   	*rawstring;
+	List	   	*elemlist;
+	ListCell   	*l;
+	bool		*newwalconsistency;
+	bool		isRmgrId = false;	/* Does this guc include any
+							* individual resource manager? */
+	bool		isAll = false;	/* Does this guc include 'all' keyword? */
+	int		i;
+
+	newwalconsistency = (bool *) guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Initialize the array*/
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char		*tok = (char *) lfirst(l);
+		bool		found = false;
+
+		/* Check if the token matches with any individual resource manager */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if mask function
+				 * is defined for this resource manager. We'll enable this feature
+				 * only for the resource managers for which a masking function
+				 * is defined.
+				 */
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+		/* If a valid resource manager is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual resource manager. Check for 'all'. */
+		if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * We'll enable this feature only for the resource managers for which
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/* guc should contain either 'all' or combination of resource managers. */
+	if (isAll && isRmgrId)
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	*extra = (void *) newwalconsistency;
+
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..ca734fe 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,10 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = ''			# Valid values are combinations of
+					# heap2, heap, btree, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
index d070312..f305eab 100644
--- a/src/bin/pg_xlogdump/pg_xlogdump.c
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -465,7 +465,12 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					   rnode.spcNode, rnode.dbNode, rnode.relNode,
 					   blk);
 			if (XLogRecHasBlockImage(record, block_id))
-				printf(" FPW");
+			{
+				if (XLogRecBlockImageApply(record, block_id))
+					printf(" FPW (apply)");
+				else
+					printf(" FPW");
+			}
 		}
 		putchar('\n');
 	}
@@ -486,21 +491,45 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
-					BKPIMAGE_IS_COMPRESSED)
+				if (XLogRecBlockImageApply(record, block_id))
 				{
-					printf(" (FPW); hole: offset: %u, length: %u, compression saved: %u\n",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
-						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+					if (record->blocks[block_id].bimg_info &
+						BKPIMAGE_IS_COMPRESSED)
+					{
+						printf(" (FPW) apply; hole: offset: %u, length: %u, "
+							"compression saved: %u\n",
+							   record->blocks[block_id].hole_offset,
+							   record->blocks[block_id].hole_length,
+							   BLCKSZ -
+							   record->blocks[block_id].hole_length -
+							   record->blocks[block_id].bimg_len);
+					}
+					else
+					{
+						printf(" (FPW) apply; hole: offset: %u, length: %u\n",
+							   record->blocks[block_id].hole_offset,
+							   record->blocks[block_id].hole_length);
+					}
 				}
 				else
 				{
-					printf(" (FPW); hole: offset: %u, length: %u\n",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+					if (record->blocks[block_id].bimg_info &
+						BKPIMAGE_IS_COMPRESSED)
+					{
+						printf(" (FPW); hole: offset: %u, length: %u, "
+							"compression saved: %u\n",
+							   record->blocks[block_id].hole_offset,
+							   record->blocks[block_id].hole_length,
+							   BLCKSZ -
+							   record->blocks[block_id].hole_length -
+							   record->blocks[block_id].bimg_len);
+					}
+					else
+					{
+						printf(" (FPW); hole: offset: %u, length: %u\n",
+							   record->blocks[block_id].hole_offset,
+							   record->blocks[block_id].hole_length);
+					}
 				}
 			}
 			putchar('\n');
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..5d19a4a 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..68192a7 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..8ec0eeb 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..3f8e7b7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5cd3022 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..006922a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..89182e2 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..fd6b9f5 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..295bf09 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..57756b8 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..697a4ef 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,7 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		apply_image; /* Restore image during WAL replay */
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
@@ -205,6 +206,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id) \
+	((decoder)->blocks[block_id].apply_image)
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..9e8ff3f 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, not that if wal_consistency
+ * is enabled this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
@@ -137,6 +146,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_APPLY		0x04		/* page image should be restored during replay */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..6fd4130 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..4ba3469
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *	  Definitions for buffer masking routines, used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied. So, we mask those bits before any
+ *	  page comparison to make them consistent.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
#93Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#92)
1 attachment(s)
Re: WAL consistency check facility

On Wed, Nov 9, 2016 at 11:32 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Fri, Nov 4, 2016 at 1:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Thank you for the new patch. This will be hopefully the last round of
reviews, we are getting close to something that has an acceptable
shape.

Thanks a lot for reviewing the patch. Based on your review, I've attached the
updated patch along with few comments.

Thanks for the new version. pg_xlogdump is really helpful now for debugging.

I haven't performed any tests with the patch, and that's all I have
regarding the code. With that done we should be in good shape
code-speaking I think...

I've done a fair amount of testing which includes regression tests
and manual creation of inconsistencies in the page. I've also found the
reason behind inconsistency in brin revmap page.
Brin revmap page doesn't have standard page layout. Besides, It doesn't update
pd_upper and pd_lower in its operations as well. But, during WAL
insertions, it uses
REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
restore image before consistency check, RestoreBlockImage fills the space
between pd_upper and pd_lower(page hole) with zero. I've posted this as a
separate thread.
/messages/by-id/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

Nice to have spotted the inconsistency. This tool is going to be useful..

I have spent a couple of hours playing with the patch, and worked on
it a bit more with a couple of minor changes:
- In gindesc.c, the if blocks are more readable with brackets.
- Addition of a comment when info is set, to mention that this is done
at the beginning of the routine so as callers of XLogInsert() can pass
the flag for consistency checks.
- apply_image should be reset in ResetDecoder().
- The BRIN problem is here:
2016-11-10 12:24:10 JST [65776]: [23-1] db=,user=,app=,client= FATAL:
Inconsistent page found, rel 1663/16385/30625, forknum 0, blkno 1
2016-11-10 12:24:10 JST [65776]: [24-1] db=,user=,app=,client=
CONTEXT: xlog redo at 0/9BD31148 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1
2016-11-10 12:24:10 JST [65776]: [25-1] db=,user=,app=,client=
WARNING: buffer refcount leak: [4540] (rel=base/16385/30625,
blockNum=1, flags=0x93800000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2506)
Now the buffer refcount leak is not normal! The safest thing to do
here is to release the buffer once a copy of it has been taken, and
the leaks goes away when calling FATAL to report the inconsistency.
- When checking for XLR_CHECK_CONSISTENCY, better to add a != 0 to get
a boolean out of it.
- guc_malloc() should be done as late as possible, this simplifies the
code and prevents any memory leaks which is what your patch is doing
when there is an error. (I have finally put my finger on what was
itching me here).
- In btree_mask, the lookup of BTP_DELETED can be deadly simplified.
- I wondered also about putting assign_wal_consistency and
check_wal_consistency in a separate file, say xlogparams.c, concluding
that the current code does nothing bad either even if it adds rmgr.h
in the list of headers in guc.c.
- Some comment blocks are longer than 72~80 characters.
- Typos!

With the patch for BRIN applied, I am able to get installcheck-world
working with wal_consistency = all and a standby doing the consistency
checks behind. Adding wal_consistency = all in PostgresNode.pm, the
recovery tests are passing. This patch is switched as "Ready for
Committer". Thanks for completing this effort begun 3 years ago!
--
Michael

Attachments:

walconsistency_v14.patchtext/plain; charset=US-ASCII; name=walconsistency_v14.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..57660d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,35 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        By default, this setting does not contain any value. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..2af524d 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,38 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * If necessary, handle the case of meta and revmap pages here.
+	 */
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..f8604db 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,31 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (opaque->flags != GIN_META)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..f7abb9c 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,52 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a Gist page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/* Mask NSN */
+	GistPageSetNSN(page_norm, PG_UINT64_MAX);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record.  Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages, mask some line pointer bits, particularly
+		 * those marked as used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..c5fe761 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,61 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = PG_UINT32_MAX;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to the current block number and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..bd1e353 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,56 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque) PageGetSpecialPointer(page_norm);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if ((maskopaq->btpo_flags & BTP_DELETED) != 0)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER,
+			   sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER,
+			   sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index db832a5..75d0e09 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -113,7 +113,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 					(ginxlogRecompressDataLeaf *) payload;
 
 					if (XLogRecHasBlockImage(record, 0))
-						appendStringInfoString(buf, " (full page image)");
+					{
+						if (XLogRecBlockImageApply(record, 0))
+							appendStringInfoString(buf, " (full page image, apply)");
+						else
+							appendStringInfoString(buf, " (full page image)");
+					}
 					else
 						desc_recompress_leaf(buf, insertData);
 				}
@@ -147,7 +152,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 				ginxlogVacuumDataLeafPage *xlrec = (ginxlogVacuumDataLeafPage *) rec;
 
 				if (XLogRecHasBlockImage(record, 0))
-					appendStringInfoString(buf, " (full page image)");
+				{
+					if (XLogRecBlockImageApply(record, 0))
+						appendStringInfoString(buf, " (full page image, apply)");
+					else
+						appendStringInfoString(buf, " (full page image)");
+				}
 				else
 					desc_recompress_leaf(buf, &xlrec->data);
 			}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..f66f73a 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,19 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..a8355659 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_string = NULL;
+bool	   *wal_consistency = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -867,6 +873,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1262,6 +1269,99 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (RmgrTable[rmid].rm_mask == NULL)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Do nothing. */
+			continue;
+		}
+
+		Assert(XLogRecHasBlockImage(record, block_id));
+
+		/*
+		 * If we've just restored the block from backup image, skip
+		 * consistency check.
+		 */
+		if (XLogRecBlockImageApply(record, block_id))
+			continue;
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, old_page_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(new_page_masked, new_page, BLCKSZ);
+
+		/* No need for this page anymore now that a copy is in */
+		ReleaseBuffer(buf);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(new_page_masked, blkno);
+		RmgrTable[rmid].rm_mask(old_page_masked, blkno);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(new_page_masked, old_page_masked, BLCKSZ) != 0)
+			elog(FATAL,
+				 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6149,6 +6249,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	new_page_masked = (char *) palloc(BLCKSZ);
+	old_page_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6949,6 +7056,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -7479,6 +7595,12 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/* Clean up buffers dedicated to WAL consistency checks */
+	if (old_page_masked)
+		pfree(old_page_masked);
+	if (new_page_masked)
+		pfree(new_page_masked);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..c635844 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -414,10 +414,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -498,6 +500,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	hdr_rdt.data = hdr_scratch;
 
 	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it. Do this before at the beginning of this routine to give the
+	 * possibility for callers of XLogInsert() to pass XLR_CHECK_CONSISTENCY
+	 * directly for a record.
+	 */
+	if (wal_consistency[rmid])
+		info |= XLR_CHECK_CONSISTENCY;
+
+	/*
 	 * Make an rdata chain containing all the data portions of all block
 	 * references. This includes the data for full-page images. Also append
 	 * the headers for the block references in the scratch buffer.
@@ -513,6 +524,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image;
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +568,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current resource manager, log a full-page write for the current
+		 * block.
+		 */
+		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -618,6 +637,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * If WAL consistency is enabled for the resource manager of
+			 * this WAL record, a full-page image is included in the record
+			 * for the block modified. During redo, the full-page is replayed
+			 * only if BKPIMAGE_APPLY is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -680,7 +708,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 56d4c66..4be6373 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -997,6 +997,7 @@ ResetDecoder(XLogReaderState *state)
 		state->blocks[block_id].in_use = false;
 		state->blocks[block_id].has_image = false;
 		state->blocks[block_id].has_data = false;
+		state->blocks[block_id].apply_image = false;
 	}
 	state->max_block_id = -1;
 }
@@ -1089,6 +1090,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			blk = &state->blocks[block_id];
 			blk->in_use = true;
+			blk->apply_image = false;
 
 			COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
 			blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
@@ -1120,6 +1122,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+
+				blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
+
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1243,6 +1248,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (!blk->in_use)
 			continue;
+
+		Assert(blk->has_image || !blk->apply_image);
+
 		if (blk->has_image)
 		{
 			blk->bkp_image = ptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..651faf2 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -275,9 +275,9 @@ XLogCheckInvalidPages(void)
  * will complain if we don't have the lock.  In hot standby mode it's
  * definitely necessary.)
  *
- * Note: when a backup block is available in XLOG, we restore it
- * unconditionally, even if the page in the database appears newer.  This is
- * to protect ourselves against database pages that were partially or
+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag
+ * set, we restore it, even if the page in the database appears newer.  This
+ * is to protect ourselves against database pages that were partially or
  * incorrectly written during a crash.  We assume that the XLOG data must be
  * good because it has passed a CRC check, while the database page might not
  * be.  This will force us to replay all subsequent modifications of the page
@@ -310,9 +310,11 @@ XLogInitBufferForRedo(XLogReaderState *record, uint8 block_id)
  * XLogReadBufferForRedoExtended
  *		Like XLogReadBufferForRedo, but with extra options.
  *
- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.
+ *
+ * In RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
  * is always BLK_NEEDS_REDO.
  *
  * (The RBM_ZERO_AND_CLEANUP_LOCK mode is redundant with the get_cleanup_lock
@@ -352,9 +354,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If it has a full-page image and it should be restored, do it. */
+	if (XLogRecBlockImageApply(record, block_id))
 	{
+		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
 		page = BufferGetPage(*buf);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..864d6a9 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..0e062ac
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking. Used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * mask_page_lsn
+ *
+ * In consistency checks, the LSN of the two pages compared will likely be
+ * different because of concurrent operations when the WAL is generated
+ * and the state of the page when WAL is applied.
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+}
+
+/*
+ * mask_page_hint_bits
+ *
+ * Mask hint bits in PageHeader.
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = PG_UINT32_MAX;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * mask_unused_space
+ *
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3c695c1..915d24c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -145,6 +147,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3254,6 +3259,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			 NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_string,
+		"",
+		check_wal_consistency, assign_wal_consistency, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9867,6 +9882,121 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		newwalconsistency[RM_MAX_ID + 1];
+	bool		isRmgrId = false;	/* Does this guc include any
+									 * individual resource manager? */
+	bool		isAll = false;		/* Does this guc include 'all' keyword? */
+
+	/* Initialize the array */
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *tok = (char *) lfirst(l);
+		bool		found = false;
+		int			i;
+
+		/* Check if the token matches with any individual resource manager */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if mask function
+				 * is defined for this resource manager. We'll enable this feature
+				 * only for the resource managers for which a masking function
+				 * is defined.
+				 */
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+
+		/* If a valid resource manager is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual resource manager. Check for 'all'. */
+		if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * This feature is enabled only for the resource managers where
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/*
+	 * Parameter should contain either 'all' or a combination of resource
+	 * managers.
+	 */
+	if (isAll && isRmgrId)
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	/* assign new value */
+	*extra = guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+	memcpy(*extra, newwalconsistency, (RM_MAX_ID + 1) * sizeof(bool));
+	return true;
+}
+
+static void
+assign_wal_consistency(const char *newval, void *extra)
+{
+	wal_consistency = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..ca734fe 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,10 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency = ''			# Valid values are combinations of
+					# heap2, heap, btree, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
index d070312..48a3d48 100644
--- a/src/bin/pg_xlogdump/pg_xlogdump.c
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -465,7 +465,12 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					   rnode.spcNode, rnode.dbNode, rnode.relNode,
 					   blk);
 			if (XLogRecHasBlockImage(record, block_id))
-				printf(" FPW");
+			{
+				if (XLogRecBlockImageApply(record, block_id))
+					printf(" FPW (apply)");
+				else
+					printf(" FPW");
+			}
 		}
 		putchar('\n');
 	}
@@ -489,7 +494,10 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				if (record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
-					printf(" (FPW); hole: offset: %u, length: %u, compression saved: %u\n",
+					printf(" (FPW)%s; hole: offset: %u, length: %u, "
+						"compression saved: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+								" apply" : "",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length,
 						   BLCKSZ -
@@ -498,7 +506,9 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				}
 				else
 				{
-					printf(" (FPW); hole: offset: %u, length: %u\n",
+					printf(" (FPW)%s; hole: offset: %u, length: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+								" apply" : "",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length);
 				}
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..5d19a4a 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..68192a7 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..8ec0eeb 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..3f8e7b7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5cd3022 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..006922a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..89182e2 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..fd6b9f5 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..295bf09 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency;
+extern char *wal_consistency_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..57756b8 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..697a4ef 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,7 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		apply_image; /* Restore image during WAL replay */
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
@@ -205,6 +206,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id) \
+	((decoder)->blocks[block_id].apply_image)
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..972d99d 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, note that if wal_consistency
+ * is enabled this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
@@ -137,6 +146,8 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_APPLY		0x04		/* page image should be restored
+										 * during replay */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..6fd4130 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..ab1a93c
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *	  Definitions for buffer masking routines, used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied. So, we mask those bits before any
+ *	  page comparison to make them consistent.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
#94Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#93)
Re: WAL consistency check facility

On Thu, Nov 10, 2016 at 10:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 9, 2016 at 11:32 PM, Kuntal Ghosh

Thanks a lot for reviewing the patch. Based on your review, I've attached the

I've done a fair amount of testing which includes regression tests
and manual creation of inconsistencies in the page. I've also found the
reason behind inconsistency in brin revmap page.
Brin revmap page doesn't have standard page layout. Besides, It doesn't update
pd_upper and pd_lower in its operations as well. But, during WAL
insertions, it uses
REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
restore image before consistency check, RestoreBlockImage fills the space
between pd_upper and pd_lower(page hole) with zero. I've posted this as a
separate thread.
/messages/by-id/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

Nice to have spotted the inconsistency. This tool is going to be useful..

I have spent a couple of hours playing with the patch, and worked on
it a bit more with a couple of minor changes:
- In gindesc.c, the if blocks are more readable with brackets.
- Addition of a comment when info is set, to mention that this is done
at the beginning of the routine so as callers of XLogInsert() can pass
the flag for consistency checks.
- apply_image should be reset in ResetDecoder().
- The BRIN problem is here:
2016-11-10 12:24:10 JST [65776]: [23-1] db=,user=,app=,client= FATAL:
Inconsistent page found, rel 1663/16385/30625, forknum 0, blkno 1
2016-11-10 12:24:10 JST [65776]: [24-1] db=,user=,app=,client=
CONTEXT: xlog redo at 0/9BD31148 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1
2016-11-10 12:24:10 JST [65776]: [25-1] db=,user=,app=,client=
WARNING: buffer refcount leak: [4540] (rel=base/16385/30625,
blockNum=1, flags=0x93800000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2506)
Now the buffer refcount leak is not normal! The safest thing to do
here is to release the buffer once a copy of it has been taken, and
the leaks goes away when calling FATAL to report the inconsistency.
- When checking for XLR_CHECK_CONSISTENCY, better to add a != 0 to get
a boolean out of it.
- guc_malloc() should be done as late as possible, this simplifies the
code and prevents any memory leaks which is what your patch is doing
when there is an error. (I have finally put my finger on what was
itching me here).
- In btree_mask, the lookup of BTP_DELETED can be deadly simplified.
- I wondered also about putting assign_wal_consistency and
check_wal_consistency in a separate file, say xlogparams.c, concluding
that the current code does nothing bad either even if it adds rmgr.h
in the list of headers in guc.c.
- Some comment blocks are longer than 72~80 characters.
- Typos!

All the changes make perfect sense to me.

With the patch for BRIN applied, I am able to get installcheck-world
working with wal_consistency = all and a standby doing the consistency
checks behind. Adding wal_consistency = all in PostgresNode.pm, the
recovery tests are passing. This patch is switched as "Ready for
Committer". Thanks for completing this effort begun 3 years ago!

Thanks to you for reviewing all the patches in so much detail. Amit, Robert
and Dilip also helped me a lot in developing the feature. Thanks to them
as well.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#94)
Re: WAL consistency check facility

On Thu, Nov 10, 2016 at 10:02 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

With the patch for BRIN applied, I am able to get installcheck-world
working with wal_consistency = all and a standby doing the consistency
checks behind. Adding wal_consistency = all in PostgresNode.pm, the
recovery tests are passing. This patch is switched as "Ready for
Committer". Thanks for completing this effort begun 3 years ago!

Thanks to you for reviewing all the patches in so much detail. Amit, Robert
and Dilip also helped me a lot in developing the feature. Thanks to them
as well.

So, who should be credited as co-authors of this patch and in what
order, if and when it gets committed? If X started this patch and
then Kuntal did a little more work on it, I would credit it as:

X and Kuntal Ghosh

If Kuntal did major work on it, though, then I would think of
something more like:

Kuntal Ghosh, based on an earlier patch from X

If he didn't use any of the old code but just the idea, then I would
do something like this:

Kuntal Ghosh, inspired by a previous patch from X

So, who are all of the people involved in the effort to produce this
patch, and what's the right way to attribute credit?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#95)
Re: WAL consistency check facility

On Fri, Nov 11, 2016 at 1:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

So, who are all of the people involved in the effort to produce this
patch, and what's the right way to attribute credit?

The original idea was from Heikki as he has introduced the idea of
doing such checks if you look at the original thread. I added on top
of it a couple of things like the concept of page masking, and hacked
an early version of the versoin we have now
(/messages/by-id/CAB7nPqR4vxdKijP+Du82vOcOnGMvutq-gfqiU2dsH4bsM77hYg@mail.gmail.com).
So it seems to me that marking Heikki as an author would be fair as
the original idea is from him. I don't mind being marked only as a
reviewer of the feature, or as someone from which has written an
earlier version of the patch, but I let that up to your judgement.
Kuntai is definitely an author.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#96)
Re: WAL consistency check facility

On Fri, Nov 11, 2016 at 3:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Nov 11, 2016 at 1:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

So, who are all of the people involved in the effort to produce this
patch, and what's the right way to attribute credit?

The original idea was from Heikki as he has introduced the idea of
doing such checks if you look at the original thread. I added on top
of it a couple of things like the concept of page masking, and hacked
an early version of the versoin we have now
(/messages/by-id/CAB7nPqR4vxdKijP+Du82vOcOnGMvutq-gfqiU2dsH4bsM77hYg@mail.gmail.com).
So it seems to me that marking Heikki as an author would be fair as
the original idea is from him. I don't mind being marked only as a
reviewer of the feature, or as someone from which has written an
earlier version of the patch, but I let that up to your judgement.
Kuntai is definitely an author.

Although lot of changes have been done later, but I've started with the patch
attached in the above thread. Hence, I feel the author of that patch should
also get the credit.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#93)
Re: WAL consistency check facility

On 11/9/16 11:55 PM, Michael Paquier wrote:

+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>

Could we name this something like wal_consistency_checking?

Otherwise it sounds like you use this to select the amount of
consistency in the WAL (similar to, say, wal_level).

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Eisentraut (#98)
Re: WAL consistency check facility

On Sun, Nov 13, 2016 at 12:06 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Could we name this something like wal_consistency_checking?

Otherwise it sounds like you use this to select the amount of
consistency in the WAL (similar to, say, wal_level).

Or wal_check? Or wal_consistency_check?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#98)
Re: WAL consistency check facility

On Sat, Nov 12, 2016 at 10:06 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 11/9/16 11:55 PM, Michael Paquier wrote:

+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>

Could we name this something like wal_consistency_checking?

Otherwise it sounds like you use this to select the amount of
consistency in the WAL (similar to, say, wal_level).

+1. I like that name.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#100)
1 attachment(s)
Re: WAL consistency check facility

On Tue, Nov 15, 2016 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Nov 12, 2016 at 10:06 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 11/9/16 11:55 PM, Michael Paquier wrote:

+     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
+      <term><varname>wal_consistency</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>

Could we name this something like wal_consistency_checking?

Otherwise it sounds like you use this to select the amount of
consistency in the WAL (similar to, say, wal_level).

+1. I like that name.

I've modified the guc parameter name as wal_consistency_check (little
hesitant for a participle in suffix :) ). Also, updated the sgml and
variable name accordingly.
FYI, regression test will fail because of an inconsistency in brin
page. I've already submitted a patch for that. Following is the thread
for the same:
/messages/by-id/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v15.patchapplication/x-download; name=walconsistency_v15.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..57660d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,35 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency-check" xreflabel="wal_consistency_check">
+      <term><varname>wal_consistency_check</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency_check</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency_check</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        By default, this setting does not contain any value. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..2af524d 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,38 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		mask_unused_space(page_norm);
+
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * If necessary, handle the case of meta and revmap pages here.
+	 */
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..f8604db 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,31 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (opaque->flags != GIN_META)
+	{
+		mask_page_hint_bits(page_norm);
+
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..f7abb9c 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,52 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a Gist page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/* Mask NSN */
+	GistPageSetNSN(page_norm, PG_UINT64_MAX);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record.  Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages, mask some line pointer bits, particularly
+		 * those marked as used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..c5fe761 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,61 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_XACT_MASK;
+			page_htup->t_choice.t_heap.t_field3.t_cid = PG_UINT32_MAX;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to the current block number and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..bd1e353 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,56 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque) PageGetSpecialPointer(page_norm);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if ((maskopaq->btpo_flags & BTP_DELETED) != 0)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER,
+			   sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER,
+			   sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page_norm, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index db832a5..75d0e09 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -113,7 +113,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 					(ginxlogRecompressDataLeaf *) payload;
 
 					if (XLogRecHasBlockImage(record, 0))
-						appendStringInfoString(buf, " (full page image)");
+					{
+						if (XLogRecBlockImageApply(record, 0))
+							appendStringInfoString(buf, " (full page image, apply)");
+						else
+							appendStringInfoString(buf, " (full page image)");
+					}
 					else
 						desc_recompress_leaf(buf, insertData);
 				}
@@ -147,7 +152,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 				ginxlogVacuumDataLeafPage *xlrec = (ginxlogVacuumDataLeafPage *) rec;
 
 				if (XLogRecHasBlockImage(record, 0))
-					appendStringInfoString(buf, " (full page image)");
+				{
+					if (XLogRecBlockImageApply(record, 0))
+						appendStringInfoString(buf, " (full page image, apply)");
+					else
+						appendStringInfoString(buf, " (full page image)");
+				}
 				else
 					desc_recompress_leaf(buf, &xlrec->data);
 			}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..f66f73a 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,19 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..a8355659 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_check_string = NULL;
+bool	   *wal_consistency_check = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -867,6 +873,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1262,6 +1269,99 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (RmgrTable[rmid].rm_mask == NULL)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/* Caller specified a bogus block_id. Do nothing. */
+			continue;
+		}
+
+		Assert(XLogRecHasBlockImage(record, block_id));
+
+		/*
+		 * If we've just restored the block from backup image, skip
+		 * consistency check.
+		 */
+		if (XLogRecBlockImageApply(record, block_id))
+			continue;
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, old_page_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(new_page_masked, new_page, BLCKSZ);
+
+		/* No need for this page anymore now that a copy is in */
+		ReleaseBuffer(buf);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(new_page_masked, blkno);
+		RmgrTable[rmid].rm_mask(old_page_masked, blkno);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(new_page_masked, old_page_masked, BLCKSZ) != 0)
+			elog(FATAL,
+				 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6149,6 +6249,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	new_page_masked = (char *) palloc(BLCKSZ);
+	old_page_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6949,6 +7056,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
@@ -7479,6 +7595,12 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/* Clean up buffers dedicated to WAL consistency checks */
+	if (old_page_masked)
+		pfree(old_page_masked);
+	if (new_page_masked)
+		pfree(new_page_masked);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..c635844 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -414,10 +414,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -498,6 +500,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	hdr_rdt.data = hdr_scratch;
 
 	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it. Do this before at the beginning of this routine to give the
+	 * possibility for callers of XLogInsert() to pass XLR_CHECK_CONSISTENCY
+	 * directly for a record.
+	 */
+	if (wal_consistency_check[rmid])
+		info |= XLR_CHECK_CONSISTENCY;
+
+	/*
 	 * Make an rdata chain containing all the data portions of all block
 	 * references. This includes the data for full-page images. Also append
 	 * the headers for the block references in the scratch buffer.
@@ -513,6 +524,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image;
 
 		if (!regbuf->in_use)
 			continue;
@@ -556,7 +568,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current resource manager, log a full-page write for the current
+		 * block.
+		 */
+		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -618,6 +637,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * If WAL consistency is enabled for the resource manager of
+			 * this WAL record, a full-page image is included in the record
+			 * for the block modified. During redo, the full-page is replayed
+			 * only if BKPIMAGE_APPLY is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -680,7 +708,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 56d4c66..4be6373 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -997,6 +997,7 @@ ResetDecoder(XLogReaderState *state)
 		state->blocks[block_id].in_use = false;
 		state->blocks[block_id].has_image = false;
 		state->blocks[block_id].has_data = false;
+		state->blocks[block_id].apply_image = false;
 	}
 	state->max_block_id = -1;
 }
@@ -1089,6 +1090,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			blk = &state->blocks[block_id];
 			blk->in_use = true;
+			blk->apply_image = false;
 
 			COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
 			blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
@@ -1120,6 +1122,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+
+				blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
+
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1243,6 +1248,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (!blk->in_use)
 			continue;
+
+		Assert(blk->has_image || !blk->apply_image);
+
 		if (blk->has_image)
 		{
 			blk->bkp_image = ptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..651faf2 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -275,9 +275,9 @@ XLogCheckInvalidPages(void)
  * will complain if we don't have the lock.  In hot standby mode it's
  * definitely necessary.)
  *
- * Note: when a backup block is available in XLOG, we restore it
- * unconditionally, even if the page in the database appears newer.  This is
- * to protect ourselves against database pages that were partially or
+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag
+ * set, we restore it, even if the page in the database appears newer.  This
+ * is to protect ourselves against database pages that were partially or
  * incorrectly written during a crash.  We assume that the XLOG data must be
  * good because it has passed a CRC check, while the database page might not
  * be.  This will force us to replay all subsequent modifications of the page
@@ -310,9 +310,11 @@ XLogInitBufferForRedo(XLogReaderState *record, uint8 block_id)
  * XLogReadBufferForRedoExtended
  *		Like XLogReadBufferForRedo, but with extra options.
  *
- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.
+ *
+ * In RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
  * is always BLK_NEEDS_REDO.
  *
  * (The RBM_ZERO_AND_CLEANUP_LOCK mode is redundant with the get_cleanup_lock
@@ -352,9 +354,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If it has a full-page image and it should be restored, do it. */
+	if (XLogRecBlockImageApply(record, block_id))
 	{
+		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
 		page = BufferGetPage(*buf);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc3a8ee..864d6a9 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1646,3 +1647,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..0e062ac
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking. Used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * mask_page_lsn
+ *
+ * In consistency checks, the LSN of the two pages compared will likely be
+ * different because of concurrent operations when the WAL is generated
+ * and the state of the page when WAL is applied.
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+}
+
+/*
+ * mask_page_hint_bits
+ *
+ * Mask hint bits in PageHeader.
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = PG_UINT32_MAX;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * mask_unused_space
+ *
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3c695c1..915d24c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -145,6 +147,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency_check(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency_check(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3254,6 +3259,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency_check", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			 NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_check_string,
+		"",
+		check_wal_consistency_check, assign_wal_consistency_check, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9867,6 +9882,121 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency_check(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		newwalconsistency[RM_MAX_ID + 1];
+	bool		isRmgrId = false;	/* Does this guc include any
+									 * individual resource manager? */
+	bool		isAll = false;		/* Does this guc include 'all' keyword? */
+
+	/* Initialize the array */
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *tok = (char *) lfirst(l);
+		bool		found = false;
+		int			i;
+
+		/* Check if the token matches with any individual resource manager */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if mask function
+				 * is defined for this resource manager. We'll enable this feature
+				 * only for the resource managers for which a masking function
+				 * is defined.
+				 */
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+
+		/* If a valid resource manager is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual resource manager. Check for 'all'. */
+		if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * This feature is enabled only for the resource managers where
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/*
+	 * Parameter should contain either 'all' or a combination of resource
+	 * managers.
+	 */
+	if (isAll && isRmgrId)
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	/* assign new value */
+	*extra = guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+	memcpy(*extra, newwalconsistency, (RM_MAX_ID + 1) * sizeof(bool));
+	return true;
+}
+
+static void
+assign_wal_consistency_check(const char *newval, void *extra)
+{
+	wal_consistency_check = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..ca734fe 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,10 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency_check = ''		# Valid values are combinations of
+					# heap2, heap, btree, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
index d070312..48a3d48 100644
--- a/src/bin/pg_xlogdump/pg_xlogdump.c
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -465,7 +465,12 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					   rnode.spcNode, rnode.dbNode, rnode.relNode,
 					   blk);
 			if (XLogRecHasBlockImage(record, block_id))
-				printf(" FPW");
+			{
+				if (XLogRecBlockImageApply(record, block_id))
+					printf(" FPW (apply)");
+				else
+					printf(" FPW");
+			}
 		}
 		putchar('\n');
 	}
@@ -489,7 +494,10 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				if (record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
-					printf(" (FPW); hole: offset: %u, length: %u, compression saved: %u\n",
+					printf(" (FPW)%s; hole: offset: %u, length: %u, "
+						"compression saved: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+								" apply" : "",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length,
 						   BLCKSZ -
@@ -498,7 +506,9 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				}
 				else
 				{
-					printf(" (FPW); hole: offset: %u, length: %u\n",
+					printf(" (FPW)%s; hole: offset: %u, length: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+								" apply" : "",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length);
 				}
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..5d19a4a 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..68192a7 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..8ec0eeb 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..3f8e7b7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5cd3022 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..006922a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..89182e2 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..fd6b9f5 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..295bf09 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency_check;
+extern char *wal_consistency_check_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..57756b8 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..697a4ef 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,7 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		apply_image; /* Restore image during WAL replay */
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
@@ -205,6 +206,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id) \
+	((decoder)->blocks[block_id].apply_image)
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..972d99d 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, note that if wal_consistency_check
+ * is enabled for a rmgr this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
@@ -137,6 +146,8 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_APPLY		0x04		/* page image should be restored
+										 * during replay */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 392a626..6fd4130 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -82,5 +82,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..ab1a93c
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *	  Definitions for buffer masking routines, used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied. So, we mask those bits before any
+ *	  page comparison to make them consistent.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
#102Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#101)
Re: WAL consistency check facility

On Tue, Nov 15, 2016 at 7:50 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've modified the guc parameter name as wal_consistency_check (little
hesitant for a participle in suffix :) ). Also, updated the sgml and
variable name accordingly.

The changes look good to me.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#102)
Re: WAL consistency check facility

On Wed, Nov 16, 2016 at 2:15 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 15, 2016 at 7:50 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've modified the guc parameter name as wal_consistency_check (little
hesitant for a participle in suffix :) ). Also, updated the sgml and
variable name accordingly.

The changes look good to me.

Moved to CF 2017-01, as no committers have showed up yet :(
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#103)
Re: WAL consistency check facility

On Mon, Nov 28, 2016 at 11:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Moved to CF 2017-01, as no committers have showed up yet :(

Seeing no other volunteers, here I am.

On a first read-through of this patch -- I have not studied it in
detail yet -- this looks pretty good to me. One concern is that this
patch adds a bit of code to XLogInsert(), which is a very hot piece of
code. Conceivably, that might produce a regression even when this is
disabled; if so, we'd probably need to make it a build-time option. I
hope that's not necessary, because I think it would be great to
compile this into the server by default, but we better make sure it's
not a problem. A bulk load into an existing table might be a good
test case.

Aside from that, I think the biggest issue here is that the masking
functions are virtually free of comments, whereas I think they should
have extensive and detailed comments. For example, in heap_mask, you
have this:

+            page_htup->t_infomask =
+                HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+                HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

For something like this, you could write "We want to ignore
differences in hint bits, since they can be set by SetHintBits without
emitting WAL. Force them all to be set so that we don't notice
discrepancies." Actually, though, I think that you could be a bit
more nuanced here: HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID =
HEAP_XMIN_FROZEN, so maybe what you should do is clear
HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID but only clear the others if
one is set but not both.

Anyway, leaving that aside, I think every single change that gets
masked in every single masking routine needs a similar comment,
explaining why that change can happen on the master without also
happening on the standby and hopefully referring to the code that
makes that unlogged change.

I think wal_consistency_checking, as proposed by Peter, is better than
wal_consistency_check, as implemented.

Having StartupXLOG() call pfree() on the masking buffers is a waste of
code. The process is going to exit anyway.

+ "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",

Primary error messages aren't capitalized.

+        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+        {
+            /* Caller specified a bogus block_id. Do nothing. */
+            continue;
+        }

Why would the caller do something so dastardly?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#104)
Re: WAL consistency check facility

On Wed, Dec 21, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 28, 2016 at 11:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Moved to CF 2017-01, as no committers have showed up yet :(

Seeing no other volunteers, here I am.

Thanks Robert for looking into the patch.

On a first read-through of this patch -- I have not studied it in
detail yet -- this looks pretty good to me. One concern is that this
patch adds a bit of code to XLogInsert(), which is a very hot piece of
code. Conceivably, that might produce a regression even when this is
disabled; if so, we'd probably need to make it a build-time option. I
hope that's not necessary, because I think it would be great to
compile this into the server by default, but we better make sure it's
not a problem. A bulk load into an existing table might be a good
test case.

I'll do this test and post the results.

+        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+        {
+            /* Caller specified a bogus block_id. Do nothing. */
+            continue;
+        }

Why would the caller do something so dastardly?

Sorry, it's my bad. I've copied the code from somewhere else, but forgot
to modify the comment. It should be something like
/* block_id is not used. */

I'll modify the above along with other suggested changes.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#104)
1 attachment(s)
Re: WAL consistency check facility

On Wed, Dec 21, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On a first read-through of this patch -- I have not studied it in
detail yet -- this looks pretty good to me. One concern is that this
patch adds a bit of code to XLogInsert(), which is a very hot piece of
code. Conceivably, that might produce a regression even when this is
disabled; if so, we'd probably need to make it a build-time option. I
hope that's not necessary, because I think it would be great to
compile this into the server by default, but we better make sure it's
not a problem. A bulk load into an existing table might be a good
test case.

I've done some bulk load testing with 16,32,64 clients. I didn't
notice any regression
in the results.

Aside from that, I think the biggest issue here is that the masking
functions are virtually free of comments, whereas I think they should
have extensive and detailed comments. For example, in heap_mask, you
have this:

+            page_htup->t_infomask =
+                HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+                HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

For something like this, you could write "We want to ignore
differences in hint bits, since they can be set by SetHintBits without
emitting WAL. Force them all to be set so that we don't notice
discrepancies." Actually, though, I think that you could be a bit
more nuanced here: HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID =
HEAP_XMIN_FROZEN, so maybe what you should do is clear
HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID but only clear the others if
one is set but not both.

I've modified it as follows:
+
+                       if (!HeapTupleHeaderXminFrozen(page_htup))
+                               page_htup->t_infomask |= HEAP_XACT_MASK;
+                       else
+                               page_htup->t_infomask |=
HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

Anyway, leaving that aside, I think every single change that gets
masked in every single masking routine needs a similar comment,
explaining why that change can happen on the master without also
happening on the standby and hopefully referring to the code that
makes that unlogged change.

I've added comments for all the masking routines.

I think wal_consistency_checking, as proposed by Peter, is better than
wal_consistency_check, as implemented.

Modified to wal_consistency_checking.

Having StartupXLOG() call pfree() on the masking buffers is a waste of
code. The process is going to exit anyway.

+ "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",

Done.

Primary error messages aren't capitalized.

+        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+        {
+            /* Caller specified a bogus block_id. Do nothing. */
+            continue;
+        }

Why would the caller do something so dastardly?

Modified to following comment:
+                       /*
+                        * WAL record doesn't contain a block reference
+                        * with the given id. Do nothing.
+                        */

I've attached the patch with the modified changes. PFA.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v16.patchapplication/x-download; name=walconsistency_v16.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8d7b3bf..fa47054 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2501,6 +2501,35 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency-checking" xreflabel="wal_consistency_checking">
+      <term><varname>wal_consistency_checking</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency_checking</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency_checking</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        By default, this setting does not contain any value. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 5a6b728..d634875 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,39 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	if (BRIN_IS_REGULAR_PAGE(page_norm))
+	{
+		/* Regular brin pages contain unused space which needs to be masked.*/
+		mask_unused_space(page_norm);
+
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index a40f168..335d62b 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,34 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(page_norm);
+	opaque = GinPageGetOpaque(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	/*
+	 * GIN metapage doesn't use pd_lower/pd_upper. Other page types do. Hence,
+	 * we need to apply masking for those pages.
+	 */
+	if (opaque->flags != GIN_META)
+	{
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty.
+		 * Hence mask everything.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			memset(page_norm, MASK_MARKER, BLCKSZ);
+		else
+			mask_unused_space(page_norm);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 5853d76..9adca76 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,58 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a Gist page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	/*
+	 * NSN is nothing but a special purpose LSN. Hence mask it for
+	 * the same reason as mask_page_lsn.
+	 */
+	GistPageSetNSN(page_norm, PG_UINT64_MAX);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record.  Hence, mask this flag.
+	 */
+	GistMarkFollowRight(page_norm);
+
+	if (GistPageIsLeaf(page_norm))
+	{
+		/*
+		 * For gist leaf pages, mask some line pointer bits, particularly
+		 * those marked as used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * In Gist redo, we never mark a page as garbage. Hence, mask it to ignore any
+	 * difference.
+	 */
+	GistClearPageHasGarbage(page_norm);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ea579a0..0892346 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9131,3 +9132,65 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber off;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++)
+	{
+		ItemId	iid = PageGetItemId(page, off);
+		char   *page_item;
+
+		page_item = (char *) (page_norm + ItemIdGetOffset(iid));
+
+		if (ItemIdIsNormal(iid))
+		{
+
+			/*
+			 * We want to ignore differences in hint bits, since they can
+			 * be set without emitting WAL.
+			 */
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			if (!HeapTupleHeaderXminFrozen(page_htup))
+				page_htup->t_infomask |= HEAP_XACT_MASK;
+			else
+				page_htup->t_infomask |= HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+
+			/* During replay, we set Command Id to FirstCommandId. Hence, mask it */
+			page_htup->t_choice.t_heap.t_field3.t_cid = PG_UINT32_MAX;
+
+			/*
+			 * For a speculative tuple, the content of t_ctid is conflicting
+			 * between the backup page and current page. Hence, we set it
+			 * to the current block number and current offset.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int len = ItemIdGetLength(iid);
+			int padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22..6e790bf 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,67 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+	OffsetNumber offnum,
+				maxoff;
+	BTPageOpaque maskopaq;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+	mask_unused_space(page_norm);
+
+	maskopaq = (BTPageOpaque) PageGetSpecialPointer(page_norm);
+
+	/*
+	 * Mask everything on a DELETED page since it will be re-initialized
+	 * during replay.
+	 */
+	if ((maskopaq->btpo_flags & BTP_DELETED) != 0)
+	{
+		/* Mask Page Content */
+		memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - SizeOfPageHeaderData);
+
+		/* Mask pd_lower and pd_upper */
+		memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER,
+			   sizeof(uint16));
+		memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER,
+			   sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page_norm);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page_norm, offnum);
+
+			if (ItemIdIsUsed(itemId))
+				itemId->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/* BTP_HAS_GARBAGE is just a hint bit. So, mask it. */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+
+	/*
+	 * During replay of a btree page split, we don't set the BTP_SPLIT_END
+	 * flag of the right sibling and initialize th cycle_id to 0 for the same
+	 * page.
+	 */
+	maskopaq->btpo_flags |= BTP_SPLIT_END;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index b058d49..d8fc7db 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -105,7 +105,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 									 leftChildBlkno, rightChildBlkno);
 				}
 				if (XLogRecHasBlockImage(record, 0))
-					appendStringInfoString(buf, " (full page image)");
+				{
+					if (XLogRecBlockImageApply(record, 0))
+						appendStringInfoString(buf, " (full page image, apply)");
+					else
+						appendStringInfoString(buf, " (full page image)");
+				}
 				else
 				{
 					char	   *payload = XLogRecGetBlockData(record, 0, NULL);
@@ -145,7 +150,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_GIN_VACUUM_DATA_LEAF_PAGE:
 			{
 				if (XLogRecHasBlockImage(record, 0))
-					appendStringInfoString(buf, " (full page image)");
+				{
+					if (XLogRecBlockImageApply(record, 0))
+						appendStringInfoString(buf, " (full page image, apply)");
+					else
+						appendStringInfoString(buf, " (full page image)");
+				}
 				else
 				{
 					ginxlogVacuumDataLeafPage *xlrec =
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index e016cdb..c2fdb98 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,23 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page	page_norm = (Page) page;
+
+	mask_page_lsn(page_norm);
+
+	mask_page_hint_bits(page_norm);
+
+	/*
+	 * Any SpGist page other than meta contains unused space which
+	 * needs to be masked.
+	 */
+	if (!SpGistPageIsMeta(page_norm))
+		mask_unused_space(page_norm);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..eae7524 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,8 +30,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f8ffa5c..935c317 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_checking_string = NULL;
+bool	   *wal_consistency_checking = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -877,6 +883,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1289,6 +1296,102 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking is applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	int		block_id;
+
+	/* records with no backup blocks have no need for consistency checks */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	/*
+	 * Leave if no masking functions defined, this is possible in the case
+	 * resource managers generating just full page writes, comparing an
+	 * image to itself has no meaning in those cases.
+	 */
+	if (RmgrTable[rmid].rm_mask == NULL)
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer	buf;
+		Page	new_page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/*
+			 * WAL record doesn't contain a block reference
+			 * with the given id. Do nothing.
+			 */
+			continue;
+		}
+
+		Assert(XLogRecHasBlockImage(record, block_id));
+
+		/*
+		 * If we've just restored the block from backup image, skip
+		 * consistency check.
+		 */
+		if (XLogRecBlockImageApply(record, block_id))
+			continue;
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+										  RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		new_page = BufferGetPage(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record
+		 * and store it in a temporary page. There is not need to allocate
+		 * a new page here, a local buffer is fine to hold its contents and
+		 * a mask can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, old_page_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * Take a copy of the new page where WAL has been applied to have
+		 * a comparison base before masking it...
+		 */
+		memcpy(new_page_masked, new_page, BLCKSZ);
+
+		/* No need for this page anymore now that a copy is in */
+		ReleaseBuffer(buf);
+
+		/* ... And mask both the new and old pages */
+		RmgrTable[rmid].rm_mask(new_page_masked, blkno);
+		RmgrTable[rmid].rm_mask(old_page_masked, blkno);
+
+		/* Time to compare the old and new contents */
+		if (memcmp(new_page_masked, old_page_masked, BLCKSZ) != 0)
+			elog(FATAL,
+				 "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6174,6 +6277,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	new_page_masked = (char *) palloc(BLCKSZ);
+	old_page_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6974,6 +7084,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 24e35a3..71b587f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -421,10 +421,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -505,6 +507,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	hdr_rdt.data = hdr_scratch;
 
 	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it. Do this before at the beginning of this routine to give the
+	 * possibility for callers of XLogInsert() to pass XLR_CHECK_CONSISTENCY
+	 * directly for a record.
+	 */
+	if (wal_consistency_checking[rmid])
+		info |= XLR_CHECK_CONSISTENCY;
+
+	/*
 	 * Make an rdata chain containing all the data portions of all block
 	 * references. This includes the data for full-page images. Also append
 	 * the headers for the block references in the scratch buffer.
@@ -520,6 +531,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image;
 
 		if (!regbuf->in_use)
 			continue;
@@ -563,7 +575,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or wal consistency check is enabled for
+		 * current resource manager, log a full-page write for the current
+		 * block.
+		 */
+		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -625,6 +644,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * If WAL consistency is enabled for the resource manager of
+			 * this WAL record, a full-page image is included in the record
+			 * for the block modified. During redo, the full-page is replayed
+			 * only if BKPIMAGE_APPLY is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -687,7 +715,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 56d4c66..4be6373 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -997,6 +997,7 @@ ResetDecoder(XLogReaderState *state)
 		state->blocks[block_id].in_use = false;
 		state->blocks[block_id].has_image = false;
 		state->blocks[block_id].has_data = false;
+		state->blocks[block_id].apply_image = false;
 	}
 	state->max_block_id = -1;
 }
@@ -1089,6 +1090,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			blk = &state->blocks[block_id];
 			blk->in_use = true;
+			blk->apply_image = false;
 
 			COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
 			blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
@@ -1120,6 +1122,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+
+				blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
+
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1243,6 +1248,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (!blk->in_use)
 			continue;
+
+		Assert(blk->has_image || !blk->apply_image);
+
 		if (blk->has_image)
 		{
 			blk->bkp_image = ptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..651faf2 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -275,9 +275,9 @@ XLogCheckInvalidPages(void)
  * will complain if we don't have the lock.  In hot standby mode it's
  * definitely necessary.)
  *
- * Note: when a backup block is available in XLOG, we restore it
- * unconditionally, even if the page in the database appears newer.  This is
- * to protect ourselves against database pages that were partially or
+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag
+ * set, we restore it, even if the page in the database appears newer.  This
+ * is to protect ourselves against database pages that were partially or
  * incorrectly written during a crash.  We assume that the XLOG data must be
  * good because it has passed a CRC check, while the database page might not
  * be.  This will force us to replay all subsequent modifications of the page
@@ -310,9 +310,11 @@ XLogInitBufferForRedo(XLogReaderState *record, uint8 block_id)
  * XLogReadBufferForRedoExtended
  *		Like XLogReadBufferForRedo, but with extra options.
  *
- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.
+ *
+ * In RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
  * is always BLK_NEEDS_REDO.
  *
  * (The RBM_ZERO_AND_CLEANUP_LOCK mode is redundant with the get_cleanup_lock
@@ -352,9 +354,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If it has a full-page image and it should be restored, do it. */
+	if (XLogRecBlockImageApply(record, block_id))
 	{
+		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
 		page = BufferGetPage(*buf);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 668d827..275ed2d 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -33,6 +33,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1741,3 +1742,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..7c2e921
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking. Used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * mask_page_lsn
+ *
+ * In consistency checks, the LSN of the two pages compared will likely be
+ * different because of concurrent operations when the WAL is generated
+ * and the state of the page when WAL is applied.
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+}
+
+/*
+ * mask_page_hint_bits
+ *
+ * Mask hint bits in PageHeader. We want to ignore differences in hint bits,
+ * since they can be set without emitting any WAL.
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = PG_UINT32_MAX;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+}
+
+/*
+ * mask_unused_space
+ *
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int pd_lower = ((PageHeader) page)->pd_lower;
+	int pd_upper = ((PageHeader) page)->pd_upper;
+	int pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 946ba9e..ebbdda5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -145,6 +147,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency_checking(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency_checking(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3252,6 +3257,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency_checking", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			 NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_checking_string,
+		"",
+		check_wal_consistency_checking, assign_wal_consistency_checking, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9875,6 +9890,121 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency_checking(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		newwalconsistency[RM_MAX_ID + 1];
+	bool		isRmgrId = false;	/* Does this guc include any
+									 * individual resource manager? */
+	bool		isAll = false;		/* Does this guc include 'all' keyword? */
+
+	/* Initialize the array */
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *tok = (char *) lfirst(l);
+		bool		found = false;
+		int			i;
+
+		/* Check if the token matches with any individual resource manager */
+		for (i = 0; i <= RM_MAX_ID; i++)
+		{
+			if (pg_strcasecmp(tok, RmgrTable[i].rm_name) == 0)
+			{
+				/*
+				 * Found a match. Now, check if mask function
+				 * is defined for this resource manager. We'll enable this feature
+				 * only for the resource managers for which a masking function
+				 * is defined.
+				 */
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+					found = true;
+					isRmgrId = true;
+					break;
+				}
+				else
+				{
+					GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+					pfree(rawstring);
+					list_free(elemlist);
+					return false;
+				}
+			}
+		}
+
+		/* If a valid resource manager is found, check for the next one. */
+		if (found)
+			continue;
+
+		/* Definitely not an individual resource manager. Check for 'all'. */
+		if (pg_strcasecmp(tok, "all") == 0)
+		{
+			/*
+			 * This feature is enabled only for the resource managers where
+			 * a masking function is defined.
+			 */
+			for (i = 0; i <= RM_MAX_ID; i++)
+			{
+				if (RmgrTable[i].rm_mask != NULL)
+				{
+					newwalconsistency[i] = true;
+				}
+			}
+			isAll = true;
+		}
+		else
+		{
+			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/*
+	 * Parameter should contain either 'all' or a combination of resource
+	 * managers.
+	 */
+	if (isAll && isRmgrId)
+	{
+		GUC_check_errdetail("Invalid value combination");
+		return false;
+	}
+
+	/* assign new value */
+	*extra = guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+	memcpy(*extra, newwalconsistency, (RM_MAX_ID + 1) * sizeof(bool));
+	return true;
+}
+
+static void
+assign_wal_consistency_checking(const char *newval, void *extra)
+{
+	wal_consistency_checking = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee8232f..4e874e1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,10 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency_checking = ''		# Valid values are combinations of
+					# heap2, heap, btree, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 23ac4e7..a170d01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
index d070312..48a3d48 100644
--- a/src/bin/pg_xlogdump/pg_xlogdump.c
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -465,7 +465,12 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					   rnode.spcNode, rnode.dbNode, rnode.relNode,
 					   blk);
 			if (XLogRecHasBlockImage(record, block_id))
-				printf(" FPW");
+			{
+				if (XLogRecBlockImageApply(record, block_id))
+					printf(" FPW (apply)");
+				else
+					printf(" FPW");
+			}
 		}
 		putchar('\n');
 	}
@@ -489,7 +494,10 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				if (record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
-					printf(" (FPW); hole: offset: %u, length: %u, compression saved: %u\n",
+					printf(" (FPW)%s; hole: offset: %u, length: %u, "
+						"compression saved: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+								" apply" : "",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length,
 						   BLCKSZ -
@@ -498,7 +506,9 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				}
 				else
 				{
-					printf(" (FPW); hole: offset: %u, length: %u\n",
+					printf(" (FPW)%s; hole: offset: %u, length: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+								" apply" : "",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length);
 				}
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..5d19a4a 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..68192a7 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..8ec0eeb 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e87a6..3f8e7b7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -460,6 +460,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5cd3022 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..006922a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -775,5 +775,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..64b92ff 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..89182e2 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index a953a5a..fd6b9f5 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 7d21408..9d6de2f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency_checking;
+extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 05f996b..183f377 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..697a4ef 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,7 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		apply_image; /* Restore image during WAL replay */
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
@@ -205,6 +206,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id) \
+	((decoder)->blocks[block_id].apply_image)
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 3dfcb49..91ee70f 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, note that if
+ * wal_consistency_checking is enabled for a rmgr this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
@@ -137,6 +146,8 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_APPLY		0x04		/* page image should be restored
+										 * during replay */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 1fd75b2..e2733dd 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -70,5 +70,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..ab1a93c
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *	  Definitions for buffer masking routines, used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied. So, we mask those bits before any
+ *	  page comparison to make them consistent.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0xFF
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+#endif
#107Michael Paquier
michael.paquier@gmail.com
In reply to: Kuntal Ghosh (#106)
Re: WAL consistency check facility

On Thu, Jan 5, 2017 at 2:54 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Wed, Dec 21, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On a first read-through of this patch -- I have not studied it in
detail yet -- this looks pretty good to me. One concern is that this
patch adds a bit of code to XLogInsert(), which is a very hot piece of
code. Conceivably, that might produce a regression even when this is
disabled; if so, we'd probably need to make it a build-time option. I
hope that's not necessary, because I think it would be great to
compile this into the server by default, but we better make sure it's
not a problem. A bulk load into an existing table might be a good
test case.

I've done some bulk load testing with 16,32,64 clients. I didn't
notice any regression
in the results.

Aside from that, I think the biggest issue here is that the masking
functions are virtually free of comments, whereas I think they should
have extensive and detailed comments. For example, in heap_mask, you
have this:

+            page_htup->t_infomask =
+                HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+                HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

For something like this, you could write "We want to ignore
differences in hint bits, since they can be set by SetHintBits without
emitting WAL. Force them all to be set so that we don't notice
discrepancies." Actually, though, I think that you could be a bit
more nuanced here: HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID =
HEAP_XMIN_FROZEN, so maybe what you should do is clear
HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID but only clear the others if
one is set but not both.

I've modified it as follows:
+
+                       if (!HeapTupleHeaderXminFrozen(page_htup))
+                               page_htup->t_infomask |= HEAP_XACT_MASK;
+                       else
+                               page_htup->t_infomask |=
HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

Anyway, leaving that aside, I think every single change that gets
masked in every single masking routine needs a similar comment,
explaining why that change can happen on the master without also
happening on the standby and hopefully referring to the code that
makes that unlogged change.

I've added comments for all the masking routines.

I think wal_consistency_checking, as proposed by Peter, is better than
wal_consistency_check, as implemented.

Modified to wal_consistency_checking.

Having StartupXLOG() call pfree() on the masking buffers is a waste of
code. The process is going to exit anyway.

+ "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",

Done.

Primary error messages aren't capitalized.

+        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+        {
+            /* Caller specified a bogus block_id. Do nothing. */
+            continue;
+        }

Why would the caller do something so dastardly?

Modified to following comment:
+                       /*
+                        * WAL record doesn't contain a block reference
+                        * with the given id. Do nothing.
+                        */

I've attached the patch with the modified changes. PFA.

Moved to CF 2017-03 with same status, "ready for committer".
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#106)
Re: WAL consistency check facility

On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've attached the patch with the modified changes. PFA.

Can this patch check contrib/bloom?

+        /*
+         * Mask some line pointer bits, particularly those marked as
+         * used on a master and unused on a standby.
+         */

Comment doesn't explain why we need to do this.

+        /*
+         * For GIN_DELETED page, the page is initialized to empty.
+         * Hence mask everything.
+         */
+        if (opaque->flags & GIN_DELETED)
+            memset(page_norm, MASK_MARKER, BLCKSZ);
+        else
+            mask_unused_space(page_norm);

If the page is initialized to empty, why do we need to mask
anything/everything? Anyway, it doesn't seem right to mask the
GinPageOpaque structure itself. Or at least it doesn't seem right to
mask the flags word.

+        /*
+         * For gist leaf pages, mask some line pointer bits, particularly
+         * those marked as used on a master and unused on a standby.
+         */

Comment doesn't explain why we need to do this.

+            if (!HeapTupleHeaderXminFrozen(page_htup))
+                page_htup->t_infomask |= HEAP_XACT_MASK;
+            else
+                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID;

Comment doesn't address this logic. Also, the "else" case shouldn't
exist at all, I think.

+            /*
+             * For a speculative tuple, the content of t_ctid is conflicting
+             * between the backup page and current page. Hence, we set it
+             * to the current block number and current offset.
+             */

Why does it differ? Is that a bug?

+    /*
+     * Mask everything on a DELETED page since it will be re-initialized
+     * during replay.
+     */
+    if ((maskopaq->btpo_flags & BTP_DELETED) != 0)
+    {
+        /* Mask Page Content */
+        memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+               BLCKSZ - SizeOfPageHeaderData);
+
+        /* Mask pd_lower and pd_upper */
+        memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER,
+               sizeof(uint16));
+        memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER,
+               sizeof(uint16));

This isn't consistent with the GIN_DELETE case - it is more selective
about what it masks. Probably that logic should be adapted to look
more like this.

+        /*
+         * Mask some line pointer bits, particularly those marked as
+         * used on a master and unused on a standby.
+         */

Comment (still) doesn't explain why we need to do this.

+    /*
+     * During replay of a btree page split, we don't set the BTP_SPLIT_END
+     * flag of the right sibling and initialize th cycle_id to 0 for the same
+     * page.
+     */

A reference to the name of the redo function might be helpful here
(and in some other places), unless it's just ${AMNAME}_redo. "th" is
a typo for "the".

+                        appendStringInfoString(buf, " (full page
image, apply)");
+                    else
+                        appendStringInfoString(buf, " (full page image)");

How about "(full page image)" and "(full page image, for WAL verification)"?

Similarly in XLogDumpDisplayRecord, I think we should assume that
"FPW" means what it has always meant, and leave that output alone.
Instead, distinguish the WAL-consistency-checking case when it
happens, e.g. "(FPW for consistency check)".

+checkConsistency(XLogReaderState *record)

How about CheckXLogConsistency()?

+ * If needs_backup is true or wal consistency check is enabled for

...or WAL checking is enabled...

+ * If WAL consistency is enabled for the resource manager of

If WAL consistency checking is enabled...

+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag

with the BKPIMAGE_APPLY flag

- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.

OK, I'm puzzled by this. Surely we don't want the WAL consistency
checking facility to cause the relation to be extended like this. And
I don't see why it would, because the WAL consistency checking happens
after the page changes have already been applied. I wonder if this
hunk is in error and should be dropped.

+    PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+    phdr->pd_prune_xid = PG_UINT32_MAX;
+    phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+    phdr->pd_flags |= PD_ALL_VISIBLE;
+#define MASK_MARKER        0xFF
(and many others)

Why do we mask by setting bits rather than clearing bits? My
intuition would have been to zero things we want to mask, rather than
setting them to one.

+                {
+                    newwalconsistency[i] = true;
+                }

Superfluous braces.

+    /*
+     * Leave if no masking functions defined, this is possible in the case
+     * resource managers generating just full page writes, comparing an
+     * image to itself has no meaning in those cases.
+     */
+    if (RmgrTable[rmid].rm_mask == NULL)
+        return;

...and also...

+            /*
+             * This feature is enabled only for the resource managers where
+             * a masking function is defined.
+             */
+            for (i = 0; i <= RM_MAX_ID; i++)
+            {
+                if (RmgrTable[i].rm_mask != NULL)

Why do we assume that the feature is only enabled for RMs that have a
mask function? Why not instead assume that if there's no masking
function, no masking is required?

+        /* Definitely not an individual resource manager. Check for 'all'. */
+        if (pg_strcasecmp(tok, "all") == 0)

It seems like it might be cleaner to check for "all" first, and then
check for individual RMs afterward.

+    /*
+     * Parameter should contain either 'all' or a combination of resource
+     * managers.
+     */
+    if (isAll && isRmgrId)
+    {
+        GUC_check_errdetail("Invalid value combination");
+        return false;
+    }

That error message is not very clear, and I don't see why we even need
to check this. If someone sets wal_consistency_checking = hash, all,
let's just set it for all and the fact that hash is also set won't
matter to anything.

+ void (*rm_mask) (char *page, BlockNumber blkno);

Could the page be passed as type "Page" rather than a "char *" to make
things more convenient for the masking functions? If not, could those
functions at least do something like "Page page = (Page) pagebytes;"
rather than "Page page_norm = (Page) page;"?

+        /*
+         * Read the contents from the current buffer and store it in a
+         * temporary page.
+         */
+        buf = XLogReadBufferExtended(rnode, forknum, blkno,
+                                          RBM_NORMAL);
+        if (!BufferIsValid(buf))
+            continue;
+
+        new_page = BufferGetPage(buf);
+
+        /*
+         * Read the contents from the backup copy, stored in WAL record
+         * and store it in a temporary page. There is not need to allocate
+         * a new page here, a local buffer is fine to hold its contents and
+         * a mask can be directly applied on it.
+         */
+        if (!RestoreBlockImage(record, block_id, old_page_masked))
+            elog(ERROR, "failed to restore block image");
+
+        /*
+         * Take a copy of the new page where WAL has been applied to have
+         * a comparison base before masking it...
+         */
+        memcpy(new_page_masked, new_page, BLCKSZ);
+
+        /* No need for this page anymore now that a copy is in */
+        ReleaseBuffer(buf);

The order of operations is strange here. We read the "new" page,
holding the pin (but no lock?). Then we restore the block image into
old_page_masked. Now we copy the new page and release the pin. It
would be better, ISTM, to rearrange that so that we finish with the
new page and release the pin before dealing with the old page. Also,
I think we need to actually lock the buffer before copying it. Maybe
that's not strictly necessary since this is all happening on the
standby, but it seems like a bad idea to set the precedent that you
can read a page without taking the content lock.

I think the "new" and "old" page terminology is kinda weird too.
Maybe we should call them the "replay image" and the "master image" or
something like that. A few more comments wouldn't hurt either.

+     * Also mask the all-visible flag.
+     *
+     * XXX: It is unfortunate that we have to do this. If the flag is set
+     * incorrectly, that's serious, and we would like to catch it. If the flag
+     * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+     * records don't currently set the flag, even though it is set in the
+     * master, so we must silence failures that that causes.
+     */
+    phdr->pd_flags |= PD_ALL_VISIBLE;

I'm puzzled by the reference to HEAP_CLEAN. The thing that might set
the all-visible bit is XLOG_HEAP2_VISIBLE, not XLOG_HEAP2_CLEAN.
Unless I'm missing something, there's no situation in which
XLOG_HEAP2_CLEAN might be associated with setting PD_ALL_VISIBLE.
Also, XLOG_HEAP2_VISIBLE records do SOMETIMES set the bit, just not
always. And there's a good reason for that, which is explained in
this comment:

* We don't bump the LSN of the heap page when setting the visibility
* map bit (unless checksums or wal_hint_bits is enabled, in which
* case we must), because that would generate an unworkable volume of
* full-page writes. This exposes us to torn page hazards, but since
* we're not inspecting the existing page contents in any way, we
* don't care.
*
* However, all operations that clear the visibility map bit *do* bump
* the LSN, and those operations will only be replayed if the XLOG LSN
* follows the page LSN. Thus, if the page LSN has advanced past our
* XLOG record's LSN, we mustn't mark the page all-visible, because
* the subsequent update won't be replayed to clear the flag.

So I think this comment needs to be rewritten with a bit more nuance.

+extern void mask_unused_space(Page page);
+#endif

Missing newline.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#108)
Re: WAL consistency check facility

On Wed, Feb 1, 2017 at 1:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've attached the patch with the modified changes. PFA.

Can this patch check contrib/bloom?

Only full pages are applied at redo by the generic WAL facility. So
you would finish by comparing a page with itself in generic_redo.

+            /*
+             * For a speculative tuple, the content of t_ctid is conflicting
+             * between the backup page and current page. Hence, we set it
+             * to the current block number and current offset.
+             */

Why does it differ? Is that a bug?

This has been discussed twice in this thread, once by me, once by Alvaro:
/messages/by-id/CAM3SWZQC8nUgp8SjKDY3d74VLpdf9puHc7-n3zf4xcr_bghPzg@mail.gmail.com
/messages/by-id/CAB7nPqQTLGvn_XePjS07kZMMw46kS6S7LfsTocK+gLpTN0bcZw@mail.gmail.com

+    /*
+     * Leave if no masking functions defined, this is possible in the case
+     * resource managers generating just full page writes, comparing an
+     * image to itself has no meaning in those cases.
+     */
+    if (RmgrTable[rmid].rm_mask == NULL)
+        return;

...and also...

+            /*
+             * This feature is enabled only for the resource managers where
+             * a masking function is defined.
+             */
+            for (i = 0; i <= RM_MAX_ID; i++)
+            {
+                if (RmgrTable[i].rm_mask != NULL)

Why do we assume that the feature is only enabled for RMs that have a
mask function? Why not instead assume that if there's no masking
function, no masking is required?

Not all RMs work on full pages. Tracking in smgr.h the list of RMs
that are no-ops when masking things is easier than having empty
routines declared all over the code base. Don't you think so>

+ void (*rm_mask) (char *page, BlockNumber blkno);

Could the page be passed as type "Page" rather than a "char *" to make
things more convenient for the masking functions? If not, could those
functions at least do something like "Page page = (Page) pagebytes;"
rather than "Page page_norm = (Page) page;"?

xlog_internal.h is used as well by frontends, this makes the header
dependencies cleaner. (I have looked at using Page when hacking this
stuff, but the header dependencies are not worth it, I don't recall
all the details though this was a couple of months back).
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#108)
Re: WAL consistency check facility

On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've attached the patch with the modified changes. PFA.

Thanks Robert for taking your time for the review. I'll update the
patch with the changes suggested by you.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#109)
Re: WAL consistency check facility

On Wed, Feb 1, 2017 at 8:01 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Feb 1, 2017 at 1:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:

+    /*
+     * Leave if no masking functions defined, this is possible in the case
+     * resource managers generating just full page writes, comparing an
+     * image to itself has no meaning in those cases.
+     */
+    if (RmgrTable[rmid].rm_mask == NULL)
+        return;

...and also...

+            /*
+             * This feature is enabled only for the resource managers where
+             * a masking function is defined.
+             */
+            for (i = 0; i <= RM_MAX_ID; i++)
+            {
+                if (RmgrTable[i].rm_mask != NULL)

Why do we assume that the feature is only enabled for RMs that have a
mask function? Why not instead assume that if there's no masking
function, no masking is required?

Not all RMs work on full pages. Tracking in smgr.h the list of RMs
that are no-ops when masking things is easier than having empty
routines declared all over the code base. Don't you think so>

Robert's suggestion surely makes the approach more general. But, the
existing approach makes it easier to decide the RM's for which we
support the consistency check facility. Surely, we can use a list to
track the RMs which should (/not) be masked. But, I think we always
have to mask the lsn of the pages at least. Isn't it?

+ void (*rm_mask) (char *page, BlockNumber blkno);

Could the page be passed as type "Page" rather than a "char *" to make
things more convenient for the masking functions? If not, could those
functions at least do something like "Page page = (Page) pagebytes;"
rather than "Page page_norm = (Page) page;"?

xlog_internal.h is used as well by frontends, this makes the header
dependencies cleaner. (I have looked at using Page when hacking this
stuff, but the header dependencies are not worth it, I don't recall
all the details though this was a couple of months back).

+ 1

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#109)
Re: WAL consistency check facility

On Tue, Jan 31, 2017 at 9:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Feb 1, 2017 at 1:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've attached the patch with the modified changes. PFA.

Can this patch check contrib/bloom?

Only full pages are applied at redo by the generic WAL facility. So
you would finish by comparing a page with itself in generic_redo.

generic_redo is more complicated than just restoring an FPI. I admit
that generic_redo *probably* has no bugs, but I don't see why it would
hurt to allow it to be checked. It's not IMPOSSIBLE that restoring
the page delta could go wrong somehow.

+            /*
+             * For a speculative tuple, the content of t_ctid is conflicting
+             * between the backup page and current page. Hence, we set it
+             * to the current block number and current offset.
+             */

Why does it differ? Is that a bug?

This has been discussed twice in this thread, once by me, once by Alvaro:
/messages/by-id/CAM3SWZQC8nUgp8SjKDY3d74VLpdf9puHc7-n3zf4xcr_bghPzg@mail.gmail.com
/messages/by-id/CAB7nPqQTLGvn_XePjS07kZMMw46kS6S7LfsTocK+gLpTN0bcZw@mail.gmail.com

Sorry, I missed/forgot about that. I think this is evidence that the
comment isn't really good enough. It says "hey, this might differ"
... with no real explanation of why it might differ or why that's OK.
If there were an explanation there, I wouldn't have flagged it.

+    /*
+     * Leave if no masking functions defined, this is possible in the case
+     * resource managers generating just full page writes, comparing an
+     * image to itself has no meaning in those cases.
+     */
+    if (RmgrTable[rmid].rm_mask == NULL)
+        return;

...and also...

+            /*
+             * This feature is enabled only for the resource managers where
+             * a masking function is defined.
+             */
+            for (i = 0; i <= RM_MAX_ID; i++)
+            {
+                if (RmgrTable[i].rm_mask != NULL)

Why do we assume that the feature is only enabled for RMs that have a
mask function? Why not instead assume that if there's no masking
function, no masking is required?

Not all RMs work on full pages. Tracking in smgr.h the list of RMs
that are no-ops when masking things is easier than having empty
routines declared all over the code base. Don't you think so>

Sure, but I don't think that's what the code is doing. If an RM is a
no-op when masking things, it must define an empty function. If it
defines no function, checking is disabled. I think we want to be able
to check any RM. No?

+ void (*rm_mask) (char *page, BlockNumber blkno);

Could the page be passed as type "Page" rather than a "char *" to make
things more convenient for the masking functions? If not, could those
functions at least do something like "Page page = (Page) pagebytes;"
rather than "Page page_norm = (Page) page;"?

xlog_internal.h is used as well by frontends, this makes the header
dependencies cleaner. (I have looked at using Page when hacking this
stuff, but the header dependencies are not worth it, I don't recall
all the details though this was a couple of months back).

OK. I still think page_norm is a lame variable name.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#108)
Re: WAL consistency check facility

On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

+            if (!HeapTupleHeaderXminFrozen(page_htup))
+                page_htup->t_infomask |= HEAP_XACT_MASK;
+            else
+                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID;

Comment doesn't address this logic. Also, the "else" case shouldn't
exist at all, I think.

In the *if* check, it just checks frozen status of xmin, so I think
you need to mask xmax related bits in else check. Can you explain
what makes you think that the else case shouldn't exist?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#113)
Re: WAL consistency check facility

On Tue, Feb 7, 2017 at 6:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

+            if (!HeapTupleHeaderXminFrozen(page_htup))
+                page_htup->t_infomask |= HEAP_XACT_MASK;
+            else
+                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID;

Comment doesn't address this logic. Also, the "else" case shouldn't
exist at all, I think.

In the *if* check, it just checks frozen status of xmin, so I think
you need to mask xmax related bits in else check. Can you explain
what makes you think that the else case shouldn't exist?

Oh, you're right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#108)
1 attachment(s)
Re: WAL consistency check facility

On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

I've attached the patch with the modified changes. PFA.

Can this patch check contrib/bloom?

Added support for generic resource manager type.

+        /*
+         * Mask some line pointer bits, particularly those marked as
+         * used on a master and unused on a standby.
+         */

Comment doesn't explain why we need to do this.

Added comments.

+        /*
+         * For GIN_DELETED page, the page is initialized to empty.
+         * Hence mask everything.
+         */
+        if (opaque->flags & GIN_DELETED)
+            memset(page_norm, MASK_MARKER, BLCKSZ);
+        else
+            mask_unused_space(page_norm);

If the page is initialized to empty, why do we need to mask
anything/everything? Anyway, it doesn't seem right to mask the
GinPageOpaque structure itself. Or at least it doesn't seem right to
mask the flags word.

Modified accordingly.

+            if (!HeapTupleHeaderXminFrozen(page_htup))
+                page_htup->t_infomask |= HEAP_XACT_MASK;
+            else
+                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID;

Comment doesn't address this logic. Also, the "else" case shouldn't
exist at all, I think.

Added comments. I think that "Else" part is needed for xmax.

+            /*
+             * For a speculative tuple, the content of t_ctid is conflicting
+             * between the backup page and current page. Hence, we set it
+             * to the current block number and current offset.
+             */

Why does it differ? Is that a bug?

Added comments.

+    /*
+     * During replay of a btree page split, we don't set the BTP_SPLIT_END
+     * flag of the right sibling and initialize th cycle_id to 0 for the same
+     * page.
+     */

A reference to the name of the redo function might be helpful here
(and in some other places), unless it's just ${AMNAME}_redo. "th" is
a typo for "the".

Corrected.

+                        appendStringInfoString(buf, " (full page
image, apply)");
+                    else
+                        appendStringInfoString(buf, " (full page image)");

How about "(full page image)" and "(full page image, for WAL verification)"?

Similarly in XLogDumpDisplayRecord, I think we should assume that
"FPW" means what it has always meant, and leave that output alone.
Instead, distinguish the WAL-consistency-checking case when it
happens, e.g. "(FPW for consistency check)".

Corrected.

+checkConsistency(XLogReaderState *record)

How about CheckXLogConsistency()?

+ * If needs_backup is true or wal consistency check is enabled for

...or WAL checking is enabled...

+ * If WAL consistency is enabled for the resource manager of

If WAL consistency checking is enabled...

+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag

with the BKPIMAGE_APPLY flag

Modified accordingly.

- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.

OK, I'm puzzled by this. Surely we don't want the WAL consistency
checking facility to cause the relation to be extended like this. And
I don't see why it would, because the WAL consistency checking happens
after the page changes have already been applied. I wonder if this
hunk is in error and should be dropped.

Dropped comments.

+    PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+    phdr->pd_prune_xid = PG_UINT32_MAX;
+    phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+    phdr->pd_flags |= PD_ALL_VISIBLE;
+#define MASK_MARKER        0xFF
(and many others)

Why do we mask by setting bits rather than clearing bits? My
intuition would have been to zero things we want to mask, rather than
setting them to one.

Modified accordingly so that we can zero things during masking instead
of setting them to one.

+                {
+                    newwalconsistency[i] = true;
+                }

Superfluous braces.

Corrected.

+    /*
+     * Leave if no masking functions defined, this is possible in the case
+     * resource managers generating just full page writes, comparing an
+     * image to itself has no meaning in those cases.
+     */
+    if (RmgrTable[rmid].rm_mask == NULL)
+        return;

...and also...

+            /*
+             * This feature is enabled only for the resource managers where
+             * a masking function is defined.
+             */
+            for (i = 0; i <= RM_MAX_ID; i++)
+            {
+                if (RmgrTable[i].rm_mask != NULL)

Why do we assume that the feature is only enabled for RMs that have a
mask function? Why not instead assume that if there's no masking
function, no masking is required?

I've introduced a function in rmgr.c, named
consistencyCheck_is_enabled, which returns true if
wal_consistency_checking is enabled for a resource manager. It does
not have any dependency on the masking function. If a masking function
is defined, then we mask the page before consistency check. However,
I'm not sure whether rmgr.c is the right place to define the function
consistencyCheck_is_enabled.

+        /* Definitely not an individual resource manager. Check for 'all'. */
+        if (pg_strcasecmp(tok, "all") == 0)

It seems like it might be cleaner to check for "all" first, and then
check for individual RMs afterward.

Done.

+    /*
+     * Parameter should contain either 'all' or a combination of resource
+     * managers.
+     */
+    if (isAll && isRmgrId)
+    {
+        GUC_check_errdetail("Invalid value combination");
+        return false;
+    }

That error message is not very clear, and I don't see why we even need
to check this. If someone sets wal_consistency_checking = hash, all,
let's just set it for all and the fact that hash is also set won't
matter to anything.

Modified accordingly.

+ void (*rm_mask) (char *page, BlockNumber blkno);

Could the page be passed as type "Page" rather than a "char *" to make
things more convenient for the masking functions? If not, could those
functions at least do something like "Page page = (Page) pagebytes;"
rather than "Page page_norm = (Page) page;"?

Modified it as "Page tempPage = (Page) page;"

+        /*
+         * Read the contents from the current buffer and store it in a
+         * temporary page.
+         */
+        buf = XLogReadBufferExtended(rnode, forknum, blkno,
+                                          RBM_NORMAL);
+        if (!BufferIsValid(buf))
+            continue;
+
+        new_page = BufferGetPage(buf);
+
+        /*
+         * Read the contents from the backup copy, stored in WAL record
+         * and store it in a temporary page. There is not need to allocate
+         * a new page here, a local buffer is fine to hold its contents and
+         * a mask can be directly applied on it.
+         */
+        if (!RestoreBlockImage(record, block_id, old_page_masked))
+            elog(ERROR, "failed to restore block image");
+
+        /*
+         * Take a copy of the new page where WAL has been applied to have
+         * a comparison base before masking it...
+         */
+        memcpy(new_page_masked, new_page, BLCKSZ);
+
+        /* No need for this page anymore now that a copy is in */
+        ReleaseBuffer(buf);

The order of operations is strange here. We read the "new" page,
holding the pin (but no lock?). Then we restore the block image into
old_page_masked. Now we copy the new page and release the pin. It
would be better, ISTM, to rearrange that so that we finish with the
new page and release the pin before dealing with the old page. Also,
I think we need to actually lock the buffer before copying it. Maybe
that's not strictly necessary since this is all happening on the
standby, but it seems like a bad idea to set the precedent that you
can read a page without taking the content lock.

Modified accordingly.

I think the "new" and "old" page terminology is kinda weird too.
Maybe we should call them the "replay image" and the "master image" or
something like that. A few more comments wouldn't hurt either.

Done.

+     * Also mask the all-visible flag.
+     *
+     * XXX: It is unfortunate that we have to do this. If the flag is set
+     * incorrectly, that's serious, and we would like to catch it. If the flag
+     * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+     * records don't currently set the flag, even though it is set in the
+     * master, so we must silence failures that that causes.
+     */
+    phdr->pd_flags |= PD_ALL_VISIBLE;

I'm puzzled by the reference to HEAP_CLEAN. The thing that might set
the all-visible bit is XLOG_HEAP2_VISIBLE, not XLOG_HEAP2_CLEAN.
Unless I'm missing something, there's no situation in which
XLOG_HEAP2_CLEAN might be associated with setting PD_ALL_VISIBLE.
Also, XLOG_HEAP2_VISIBLE records do SOMETIMES set the bit, just not
always. And there's a good reason for that, which is explained in
this comment:

* We don't bump the LSN of the heap page when setting the visibility
* map bit (unless checksums or wal_hint_bits is enabled, in which
* case we must), because that would generate an unworkable volume of
* full-page writes. This exposes us to torn page hazards, but since
* we're not inspecting the existing page contents in any way, we
* don't care.
*
* However, all operations that clear the visibility map bit *do* bump
* the LSN, and those operations will only be replayed if the XLOG LSN
* follows the page LSN. Thus, if the page LSN has advanced past our
* XLOG record's LSN, we mustn't mark the page all-visible, because
* the subsequent update won't be replayed to clear the flag.

So I think this comment needs to be rewritten with a bit more nuance.

Corrected.

+extern void mask_unused_space(Page page);
+#endif

Missing newline.

Done.

Thank you Robert for the review. Please find the updated patch in the
attachment.

Thanks to Amit Kapila and Dilip Kumar for their suggestions in offline
discussions.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

walconsistency_v17.patchapplication/x-download; name=walconsistency_v17.patchDownload
From 6a337bd256b6c65051cc29ac321e85c629470d9a Mon Sep 17 00:00:00 2001
From: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Date: Wed, 8 Feb 2017 11:14:16 +0530
Subject: [PATCH] This patch implements a parameter, wal_consistency_checking,
 which is used to check the consistency of WAL records, i.e, whether the WAL
 records are inserted and applied correctly.

---
 doc/src/sgml/config.sgml                      |  29 ++++++
 src/backend/access/brin/brin_xlog.c           |  20 +++++
 src/backend/access/gin/ginxlog.c              |  32 +++++++
 src/backend/access/gist/gistxlog.c            |  43 +++++++++
 src/backend/access/heap/heapam.c              |  79 ++++++++++++++++
 src/backend/access/nbtree/nbtxlog.c           |  50 +++++++++++
 src/backend/access/rmgrdesc/gindesc.c         |  14 ++-
 src/backend/access/spgist/spgxlog.c           |  21 +++++
 src/backend/access/transam/generic_xlog.c     |  12 +++
 src/backend/access/transam/rmgr.c             |  30 ++++++-
 src/backend/access/transam/xlog.c             | 114 +++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c       |  38 ++++++--
 src/backend/access/transam/xlogreader.c       |   8 ++
 src/backend/access/transam/xlogutils.c        |  11 +--
 src/backend/commands/sequence.c               |  12 +++
 src/backend/storage/buffer/Makefile           |   2 +-
 src/backend/storage/buffer/bufmask.c          | 125 ++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c                  | 115 ++++++++++++++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/bin/pg_rewind/parsexlog.c                 |   2 +-
 src/bin/pg_xlogdump/pg_xlogdump.c             |  16 +++-
 src/bin/pg_xlogdump/rmgrdesc.c                |   2 +-
 src/include/access/brin_xlog.h                |   1 +
 src/include/access/generic_xlog.h             |   1 +
 src/include/access/gin.h                      |   1 +
 src/include/access/gist_private.h             |   1 +
 src/include/access/heapam_xlog.h              |   1 +
 src/include/access/nbtree.h                   |   1 +
 src/include/access/rmgr.h                     |   4 +-
 src/include/access/rmgrlist.h                 |  44 ++++-----
 src/include/access/spgist.h                   |   1 +
 src/include/access/xlog.h                     |   2 +
 src/include/access/xlog_internal.h            |   5 ++
 src/include/access/xlogreader.h               |   3 +
 src/include/access/xlogrecord.h               |  14 ++-
 src/include/commands/sequence.h               |   1 +
 src/include/storage/bufmask.h                 |  30 +++++++
 37 files changed, 844 insertions(+), 45 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fb5d647..422c887 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2515,6 +2515,35 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-consistency-checking" xreflabel="wal_consistency_checking">
+      <term><varname>wal_consistency_checking</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>wal_consistency_checking</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter is used to check the consistency of WAL records, i.e,
+        whether the WAL records are inserted and applied correctly. When
+        <varname>wal_consistency_checking</varname> is enabled for a WAL record, it
+        stores a full-page image along with the record. When a full-page image
+        arrives during redo, it compares against the current page to check whether
+        both are consistent.
+       </para>
+
+       <para>
+        By default, this setting does not contain any value. To check
+        all records written to the write-ahead log, set this parameter to
+        <literal>all</literal>. To check only some records, specify a
+        comma-separated list of resource managers. The resource managers
+        which are currently supported are <literal>heap2</>, <literal>heap</>,
+        <literal>btree</>, <literal>generic</>, <literal>gin</>, <literal>gist</>,
+        <literal>spgist</>, <literal>sequence</> and <literal>brin</>. Only
+        superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
       <term><varname>wal_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index b698c9b..02a526d 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 
 
 /*
@@ -279,3 +280,22 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a BRIN page before doing consistency checks.
+ */
+void
+brin_mask(char *page, BlockNumber blkno)
+{
+	Page		tempPage = (Page) page;
+
+	mask_page_lsn(tempPage);
+
+	mask_page_hint_bits(tempPage);
+
+	if (BRIN_IS_REGULAR_PAGE(tempPage))
+	{
+		/* Regular brin pages contain unused space which needs to be masked. */
+		mask_unused_space(tempPage);
+	}
+}
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 8468fe8..025441a 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -15,6 +15,7 @@
 
 #include "access/gin_private.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -758,3 +759,34 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a GIN page before running consistency checks on it.
+ */
+void
+gin_mask(char *page, BlockNumber blkno)
+{
+	Page		tempPage = (Page) page;
+	GinPageOpaque opaque;
+
+	mask_page_lsn(tempPage);
+	opaque = GinPageGetOpaque(tempPage);
+
+	mask_page_hint_bits(tempPage);
+
+	/*
+	 * GIN metapage doesn't use pd_lower/pd_upper. Other page types do. Hence,
+	 * we need to apply masking for those pages.
+	 */
+	if (opaque->flags != GIN_META)
+	{
+		/*
+		 * For GIN_DELETED page, the page is initialized to empty. Hence, mask
+		 * the page content.
+		 */
+		if (opaque->flags & GIN_DELETED)
+			mask_page_content(tempPage);
+		else
+			mask_unused_space(tempPage);
+	}
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 88b97a4..4adc07b 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -16,6 +16,7 @@
 #include "access/gist_private.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
@@ -343,6 +344,48 @@ gist_xlog_cleanup(void)
 }
 
 /*
+ * Mask a Gist page before running consistency checks on it.
+ */
+void
+gist_mask(char *page, BlockNumber blkno)
+{
+	Page		tempPage = (Page) page;
+
+	mask_page_lsn(tempPage);
+
+	mask_page_hint_bits(tempPage);
+	mask_unused_space(tempPage);
+
+	/*
+	 * NSN is nothing but a special purpose LSN. Hence, mask it for the same
+	 * reason as mask_page_lsn.
+	 */
+	GistPageSetNSN(tempPage, (uint64) MASK_MARKER);
+
+	/*
+	 * We update F_FOLLOW_RIGHT flag on the left child after writing WAL
+	 * record. Hence, mask this flag. See gistplacetopage() for details.
+	 */
+	GistMarkFollowRight(tempPage);
+
+	if (GistPageIsLeaf(tempPage))
+	{
+		/*
+		 * In gist leaf pages, it is possible to modify the LP_FLAGS without
+		 * emitting any WAL record. Hence, mask the line pointer flags. See
+		 * gistkillitems() for details.
+		 */
+		mask_lp_flags(tempPage);
+	}
+
+	/*
+	 * During gist redo, we never mark a page as garbage. Hence, mask it to
+	 * ignore any differences.
+	 */
+	GistClearPageHasGarbage(tempPage);
+}
+
+/*
  * Write WAL record of a page split.
  */
 XLogRecPtr
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5fd7f1e..3b59b97 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/bufmask.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -9142,3 +9143,81 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ * Mask a heap page before performing consistency checks on it.
+ */
+void
+heap_mask(char *page, BlockNumber blkno)
+{
+	Page		tempPage = (Page) page;
+	OffsetNumber off;
+
+	mask_page_lsn(tempPage);
+
+	mask_page_hint_bits(tempPage);
+	mask_unused_space(tempPage);
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(tempPage); off++)
+	{
+		ItemId		iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (tempPage + ItemIdGetOffset(iid));
+
+		if (ItemIdIsNormal(iid))
+		{
+
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			/*
+			 * If xmin of a tuple is not yet frozen, we should ignore
+			 * differences in hint bits, since they can be set without
+			 * emitting WAL.
+			 */
+			if (!HeapTupleHeaderXminFrozen(page_htup))
+				page_htup->t_infomask &= ~HEAP_XACT_MASK;
+			else
+			{
+				/* Still we need to mask xmax hint bits. */
+				page_htup->t_infomask &= ~HEAP_XMAX_INVALID;
+				page_htup->t_infomask &= ~HEAP_XMAX_COMMITTED;
+			}
+
+			/*
+			 * During replay, we set Command Id to FirstCommandId. Hence, mask
+			 * it. See heap_xlog_insert() for details.
+			 */
+			page_htup->t_choice.t_heap.t_field3.t_cid = MASK_MARKER;
+
+			/*
+			 * For a speculative tuple, heap_insert() does not set ctid in the
+			 * caller-passed heap tuple itself, leaving the ctid field to
+			 * contain a speculative token value - a per-backend monotonically
+			 * increasing identifier. Besides, it does not WAL-log ctid under
+			 * any circumstances.
+			 *
+			 * During redo, heap_xlog_insert() sets t_ctid to current block
+			 * number and self offset number. It doesn't care about any
+			 * speculative insertions in master. Hence, we set t_ctid to
+			 * current block number and self offset number to ignore any
+			 * inconsistency.
+			 */
+			if (HeapTupleHeaderIsSpeculative(page_htup))
+				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of the
+		 * item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int			len = ItemIdGetLength(iid);
+			int			padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index efad745..b39afc6 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -19,6 +19,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1028,3 +1029,52 @@ btree_redo(XLogReaderState *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * Mask a btree page before performing consistency checks on it.
+ */
+void
+btree_mask(char *page, BlockNumber blkno)
+{
+	Page		tempPage = (Page) page;
+	BTPageOpaque maskopaq;
+
+	mask_page_lsn(tempPage);
+
+	mask_page_hint_bits(tempPage);
+	mask_unused_space(tempPage);
+
+	maskopaq = (BTPageOpaque) PageGetSpecialPointer(tempPage);
+
+	if (P_ISDELETED(maskopaq))
+	{
+		/*
+		 * Mask page content on a DELETED page since it will be re-initialized
+		 * during replay. See btree_xlog_unlink_page() for details.
+		 */
+		mask_page_content(tempPage);
+	}
+	else if (P_ISLEAF(maskopaq))
+	{
+		/*
+		 * In btree leaf pages, it is possible to modify the LP_FLAGS without
+		 * emitting any WAL record. Hence, mask the line pointer flags. See
+		 * _bt_killitems(), _bt_check_unique() for details.
+		 */
+		mask_lp_flags(tempPage);
+	}
+
+	/*
+	 * BTP_HAS_GARBAGE is just an un-logged hint bit. So, mask it. See
+	 * _bt_killitems(), _bt_check_unique() for details.
+	 */
+	maskopaq->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/*
+	 * During replay of a btree page split, we don't set the BTP_SPLIT_END
+	 * flag of the right sibling and initialize the cycle_id to 0 for the same
+	 * page. See btree_xlog_split() for details.
+	 */
+	maskopaq->btpo_flags &= ~BTP_SPLIT_END;
+	maskopaq->btpo_cycleid = 0;
+}
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index 9e488b3..d4ed7f9 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -105,7 +105,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 									 leftChildBlkno, rightChildBlkno);
 				}
 				if (XLogRecHasBlockImage(record, 0))
-					appendStringInfoString(buf, " (full page image)");
+				{
+					if (XLogRecBlockImageApply(record, 0))
+						appendStringInfoString(buf, " (full page image)");
+					else
+						appendStringInfoString(buf, " (full page image, for WAL verification)");
+				}
 				else
 				{
 					char	   *payload = XLogRecGetBlockData(record, 0, NULL);
@@ -145,7 +150,12 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_GIN_VACUUM_DATA_LEAF_PAGE:
 			{
 				if (XLogRecHasBlockImage(record, 0))
-					appendStringInfoString(buf, " (full page image)");
+				{
+					if (XLogRecBlockImageApply(record, 0))
+						appendStringInfoString(buf, " (full page image)");
+					else
+						appendStringInfoString(buf, " (full page image, for WAL verification)");
+				}
 				else
 				{
 					ginxlogVacuumDataLeafPage *xlrec =
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dc6a5a..52e7b46 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -18,6 +18,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "storage/bufmask.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
@@ -1023,3 +1024,23 @@ spg_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	opCtx = NULL;
 }
+
+/*
+ * Mask a SpGist page before performing consistency checks on it.
+ */
+void
+spg_mask(char *page, BlockNumber blkno)
+{
+	Page		tempPage = (Page) page;
+
+	mask_page_lsn(tempPage);
+
+	mask_page_hint_bits(tempPage);
+
+	/*
+	 * Any SpGist page other than meta contains unused space which needs to be
+	 * masked.
+	 */
+	if (!SpGistPageIsMeta(tempPage))
+		mask_unused_space(tempPage);
+}
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index eddec9b..0c0700b 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -16,6 +16,7 @@
 #include "access/generic_xlog.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/bufmask.h"
 #include "utils/memutils.h"
 
 /*-------------------------------------------------------------------------
@@ -533,3 +534,14 @@ generic_redo(XLogReaderState *record)
 			UnlockReleaseBuffer(buffers[block_id]);
 	}
 }
+
+/*
+ * Mask a generic page before performing consistency checks on it.
+ */
+void
+generic_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..a0fa4c4 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,9 +30,35 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
-	{ name, redo, desc, identify, startup, cleanup },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
+	{ name, redo, desc, identify, startup, cleanup, mask },
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
 };
+
+/*
+ * consistencyCheck_is_enabled
+ *
+ * Returns true if consistency checking is enabled for the given rmid,
+ * false otherwise.
+ */
+bool
+consistencyCheck_is_enabled(RmgrId rmid)
+{
+	switch (rmid)
+	{
+		case RM_HEAP2_ID:
+		case RM_HEAP_ID:
+		case RM_BTREE_ID:
+		case RM_GIN_ID:
+		case RM_GIST_ID:
+		case RM_SEQ_ID:
+		case RM_SPGIST_ID:
+		case RM_BRIN_ID:
+		case RM_GENERIC_ID:
+			return true;
+		default:
+			return false;
+	}
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2f5d603..ea5fd3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -95,6 +95,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+char	   *wal_consistency_checking_string = NULL;
+bool	   *wal_consistency_checking = NULL;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -245,6 +247,10 @@ bool		InArchiveRecovery = false;
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
+/* Buffers dedicated to consistency checks of size BLCKSZ */
+static char *replay_image_masked = NULL;
+static char *master_image_masked = NULL;
+
 /* options taken from recovery.conf for archive recovery */
 char	   *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
@@ -903,6 +909,7 @@ static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+static void checkXLogConsistency(XLogReaderState *record);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1315,6 +1322,97 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 }
 
 /*
+ * Checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, a
+ * masking can be applied to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper among other things. This
+ * function should be called once WAL replay has been completed for a
+ * given record.
+ */
+static void
+checkXLogConsistency(XLogReaderState *record)
+{
+	RmgrId		rmid = XLogRecGetRmid(record);
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	BlockNumber blkno;
+	int			block_id;
+
+	/* Records with no backup blocks have no need for consistency checks. */
+	if (!XLogRecHasAnyBlockRefs(record))
+		return;
+
+	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
+	Assert(consistencyCheck_is_enabled(rmid));
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		Buffer		buf;
+		Page		page;
+
+		if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+		{
+			/*
+			 * WAL record doesn't contain a block reference with the given id.
+			 * Do nothing.
+			 */
+			continue;
+		}
+
+		Assert(XLogRecHasBlockImage(record, block_id));
+
+		/*
+		 * Read the contents from the current buffer and store it in a
+		 * temporary page.
+		 */
+		buf = XLogReadBufferExtended(rnode, forknum, blkno,
+									 RBM_NORMAL);
+		if (!BufferIsValid(buf))
+			continue;
+
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buf);
+
+		/*
+		 * Take a copy of the local page where WAL has been applied to have a
+		 * comparison base before masking it...
+		 */
+		memcpy(replay_image_masked, page, BLCKSZ);
+
+		/* No need for this page anymore now that a copy is in. */
+		UnlockReleaseBuffer(buf);
+
+		/*
+		 * Read the contents from the backup copy, stored in WAL record and
+		 * store it in a temporary page. There is not need to allocate a new
+		 * page here, a local buffer is fine to hold its contents and a mask
+		 * can be directly applied on it.
+		 */
+		if (!RestoreBlockImage(record, block_id, master_image_masked))
+			elog(ERROR, "failed to restore block image");
+
+		/*
+		 * If masking function is defined, mask both the master and replay
+		 * images
+		 */
+		if (RmgrTable[rmid].rm_mask != NULL)
+		{
+			RmgrTable[rmid].rm_mask(replay_image_masked, blkno);
+			RmgrTable[rmid].rm_mask(master_image_masked, blkno);
+		}
+
+		/* Time to compare the master and replay images. */
+		if (memcmp(replay_image_masked, master_image_masked, BLCKSZ) != 0)
+		{
+			elog(FATAL,
+			   "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode,
+				 forknum, blkno);
+		}
+	}
+}
+
+/*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
  */
@@ -6200,6 +6298,13 @@ StartupXLOG(void)
 		   errdetail("Failed while allocating an XLog reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Allocate pages dedicated to WAL consistency checks, those had better
+	 * be aligned.
+	 */
+	replay_image_masked = (char *) palloc(BLCKSZ);
+	master_image_masked = (char *) palloc(BLCKSZ);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -7000,6 +7105,15 @@ StartupXLOG(void)
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
+				/*
+				 * After redo, check whether the backup pages associated with
+				 * the WAL record are consistent with the existing pages. This
+				 * check is done only if consistency check is enabled for this
+				 * record.
+				 */
+				if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
+					checkXLogConsistency(xlogreader);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index a5aa58d..797e68c 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -421,10 +421,12 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
-	 * reserved for use by me.
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
@@ -505,6 +507,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	hdr_rdt.data = hdr_scratch;
 
 	/*
+	 * Enforce consistency checks for this record if user is looking for
+	 * it. Do this before at the beginning of this routine to give the
+	 * possibility for callers of XLogInsert() to pass XLR_CHECK_CONSISTENCY
+	 * directly for a record.
+	 */
+	if (wal_consistency_checking[rmid])
+		info |= XLR_CHECK_CONSISTENCY;
+
+	/*
 	 * Make an rdata chain containing all the data portions of all block
 	 * references. This includes the data for full-page images. Also append
 	 * the headers for the block references in the scratch buffer.
@@ -520,6 +531,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		XLogRecordBlockCompressHeader cbimg = {0};
 		bool		samerel;
 		bool		is_compressed = false;
+		bool		include_image;
 
 		if (!regbuf->in_use)
 			continue;
@@ -563,7 +575,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
 			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
 
-		if (needs_backup)
+		/*
+		 * If needs_backup is true or WAL checking is enabled for
+		 * current resource manager, log a full-page write for the current
+		 * block.
+		 */
+		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;
+
+		if (include_image)
 		{
 			Page		page = regbuf->page;
 			uint16		compressed_len;
@@ -625,6 +644,15 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
 
+			/*
+			 * If WAL consistency checking is enabled for the resource manager of
+			 * this WAL record, a full-page image is included in the record
+			 * for the block modified. During redo, the full-page is replayed
+			 * only if BKPIMAGE_APPLY is set.
+			 */
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
 			if (is_compressed)
 			{
 				bimg.length = compressed_len;
@@ -687,7 +715,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		/* Ok, copy the header to the scratch buffer */
 		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
 		scratch += SizeOfXLogRecordBlockHeader;
-		if (needs_backup)
+		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
 			scratch += SizeOfXLogRecordBlockImageHeader;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index b528745..f077662 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -997,6 +997,7 @@ ResetDecoder(XLogReaderState *state)
 		state->blocks[block_id].in_use = false;
 		state->blocks[block_id].has_image = false;
 		state->blocks[block_id].has_data = false;
+		state->blocks[block_id].apply_image = false;
 	}
 	state->max_block_id = -1;
 }
@@ -1089,6 +1090,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			blk = &state->blocks[block_id];
 			blk->in_use = true;
+			blk->apply_image = false;
 
 			COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
 			blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
@@ -1120,6 +1122,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
+
+				blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
+
 				if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
 				{
 					if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
@@ -1243,6 +1248,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (!blk->in_use)
 			continue;
+
+		Assert(blk->has_image || !blk->apply_image);
+
 		if (blk->has_image)
 		{
 			blk->bkp_image = ptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 0de2419..6627f54 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -275,9 +275,9 @@ XLogCheckInvalidPages(void)
  * will complain if we don't have the lock.  In hot standby mode it's
  * definitely necessary.)
  *
- * Note: when a backup block is available in XLOG, we restore it
- * unconditionally, even if the page in the database appears newer.  This is
- * to protect ourselves against database pages that were partially or
+ * Note: when a backup block is available in XLOG with the BKPIMAGE_APPLY flag
+ * set, we restore it, even if the page in the database appears newer.  This
+ * is to protect ourselves against database pages that were partially or
  * incorrectly written during a crash.  We assume that the XLOG data must be
  * good because it has passed a CRC check, while the database page might not
  * be.  This will force us to replay all subsequent modifications of the page
@@ -352,9 +352,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
-	/* If it's a full-page image, restore it. */
-	if (XLogRecHasBlockImage(record, block_id))
+	/* If it has a full-page image and it should be restored, do it. */
+	if (XLogRecBlockImageApply(record, block_id))
 	{
+		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
 		   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
 		page = BufferGetPage(*buf);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index c148b09..29742c6 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -33,6 +33,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "storage/bufmask.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -1740,3 +1741,14 @@ ResetSequenceCaches(void)
 
 	last_used_seq = NULL;
 }
+
+/*
+ * Mask a Sequence page before performing consistency checks on it.
+ */
+void
+seq_mask(char *page, BlockNumber blkno)
+{
+	mask_page_lsn(page);
+
+	mask_unused_space(page);
+}
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..8630dca 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c
new file mode 100644
index 0000000..68089b6
--- /dev/null
+++ b/src/backend/storage/buffer/bufmask.c
@@ -0,0 +1,125 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.c
+ *	  Routines for buffer masking. Used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * Contains common routines required for masking a page.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmask.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/bufmask.h"
+
+/*
+ * mask_page_lsn
+ *
+ * In consistency checks, the LSN of the two pages compared will likely be
+ * different because of concurrent operations when the WAL is generated
+ * and the state of the page when WAL is applied.
+ */
+void
+mask_page_lsn(Page page)
+{
+	PageHeader	phdr = (PageHeader) page;
+
+	PageXLogRecPtrSet(phdr->pd_lsn, (uint64) MASK_MARKER);
+}
+
+/*
+ * mask_page_hint_bits
+ *
+ * Mask hint bits in PageHeader. We want to ignore differences in hint bits,
+ * since they can be set without emitting any WAL.
+ */
+void
+mask_page_hint_bits(Page page)
+{
+	PageHeader	phdr = (PageHeader) page;
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = MASK_MARKER;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints. */
+	PageClearFull(page);
+	PageClearHasFreeLinePointers(page);
+
+	/*
+	 * During replay, if the page LSN has advanced past our XLOG record's LSN,
+	 * we don't mark the page all-visible. See heap_xlog_visible() for
+	 * details.
+	 */
+	PageClearAllVisible(page);
+}
+
+/*
+ * mask_unused_space
+ *
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+void
+mask_unused_space(Page page)
+{
+	int			pd_lower = ((PageHeader) page)->pd_lower;
+	int			pd_upper = ((PageHeader) page)->pd_upper;
+	int			pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n",
+			 pd_lower, pd_upper, pd_special);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * mask_lp_flags
+ *
+ * Line pointer flags can be modified in master without emitting any WAL record.
+ * Hence, We want to ignore differences in line pointer flags.
+ */
+void
+mask_lp_flags(Page page)
+{
+	OffsetNumber offnum,
+				maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsUsed(itemId))
+			itemId->lp_flags = LP_UNUSED;
+	}
+}
+
+/*
+ * mask_page_content
+ */
+void
+mask_page_content(Page page)
+{
+	/* Mask Page Content */
+	memset(page + SizeOfPageHeaderData, MASK_MARKER,
+		   BLCKSZ - SizeOfPageHeaderData);
+
+	/* Mask pd_lower and pd_upper */
+	memset(&((PageHeader) page)->pd_lower, MASK_MARKER,
+		   sizeof(uint16));
+	memset(&((PageHeader) page)->pd_upper, MASK_MARKER,
+		   sizeof(uint16));
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c53aede..4c8216c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,9 +28,11 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -147,6 +149,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval,
 static bool check_log_destination(char **newval, void **extra, GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
+static bool check_wal_consistency_checking(char **newval, void **extra, GucSource source);
+static void assign_wal_consistency_checking(const char *newval, void *extra);
+
 #ifdef HAVE_SYSLOG
 static int	syslog_facility = LOG_LOCAL0;
 #else
@@ -3266,6 +3271,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"wal_consistency_checking", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("Sets the WAL resource managers for which WAL consistency checks are done."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&wal_consistency_checking_string,
+		"",
+		check_wal_consistency_checking, assign_wal_consistency_checking, NULL
+	},
+	{
 		{"log_destination", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Sets the destination for server log output."),
 			gettext_noop("Valid values are combinations of \"stderr\", "
@@ -9889,6 +9904,106 @@ call_enum_check_hook(struct config_enum * conf, int *newval, void **extra,
  */
 
 static bool
+check_wal_consistency_checking(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		newwalconsistency[RM_MAX_ID + 1];
+
+	/* Initialize the array */
+	MemSet(newwalconsistency, 0, (RM_MAX_ID + 1) * sizeof(bool));
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *tok = (char *) lfirst(l);
+		bool		found = false;
+		RmgrId		rmid;
+
+		/* Check for 'all'. */
+		if (pg_strcasecmp(tok, "all") == 0)
+		{
+			for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+			{
+				/*
+				 * Check if consistency checking is enabled for this resource
+				 * manager.
+				 */
+				if (consistencyCheck_is_enabled(rmid))
+					newwalconsistency[rmid] = true;
+			}
+			found = true;
+		}
+		else
+		{
+			/*
+			 * Check if the token matches with any individual resource
+			 * manager.
+			 */
+			for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+			{
+				if (pg_strcasecmp(tok, RmgrTable[rmid].rm_name) == 0)
+				{
+					/*
+					 * Found a match. Now, check if consistency checking is
+					 * enabled for this resource manager.
+					 */
+					if (consistencyCheck_is_enabled(rmid))
+					{
+						newwalconsistency[rmid] = true;
+						found = true;
+						break;
+					}
+					else
+					{
+						GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+						pfree(rawstring);
+						list_free(elemlist);
+						return false;
+					}
+				}
+			}
+		}
+
+		/* If a valid resource manager is found, check for the next one. */
+		if (found)
+			continue;
+
+		GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	/* assign new value */
+	*extra = guc_malloc(ERROR, (RM_MAX_ID + 1) * sizeof(bool));
+	memcpy(*extra, newwalconsistency, (RM_MAX_ID + 1) * sizeof(bool));
+	return true;
+}
+
+static void
+assign_wal_consistency_checking(const char *newval, void *extra)
+{
+	wal_consistency_checking = (bool *) extra;
+}
+
+static bool
 check_log_destination(char **newval, void **extra, GucSource source)
 {
 	char	   *rawstring;
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 661b0fa..843abf5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -191,6 +191,10 @@
 					#   open_sync
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
+#wal_consistency_checking = ''		# Valid values are combinations of
+					# heap2, heap, btree, generic, gin, gist,
+					# sequence, spgist and brin. It can also
+					# be set to 'all' to enable all the values
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index cb43381..a7f6fe2 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -29,7 +29,7 @@
  * RmgrNames is an array of resource manager names, to make error messages
  * a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
   name,
 
 static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
index 590d2ad..fdc657b 100644
--- a/src/bin/pg_xlogdump/pg_xlogdump.c
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -465,7 +465,12 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					   rnode.spcNode, rnode.dbNode, rnode.relNode,
 					   blk);
 			if (XLogRecHasBlockImage(record, block_id))
-				printf(" FPW");
+			{
+				if (XLogRecBlockImageApply(record, block_id))
+					printf(" FPW");
+				else
+					printf(" FPW, for WAL verification");
+			}
 		}
 		putchar('\n');
 	}
@@ -489,7 +494,10 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				if (record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
-					printf(" (FPW); hole: offset: %u, length: %u, compression saved: %u\n",
+					printf(" (FPW%s); hole: offset: %u, length: %u, "
+						   "compression saved: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+						   "" : ", for WAL verification",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length,
 						   BLCKSZ -
@@ -498,7 +506,9 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				}
 				else
 				{
-					printf(" (FPW); hole: offset: %u, length: %u\n",
+					printf(" (FPW%s); hole: offset: %u, length: %u\n",
+						   XLogRecBlockImageApply(record, block_id) ?
+						   "" : ", for WAL verification",
 						   record->blocks[block_id].hole_offset,
 						   record->blocks[block_id].hole_length);
 				}
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..5d19a4a 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	{ name, desc, identify},
 
 const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = {
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index 527b2f1..8e06b56 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
+extern void brin_mask(char *page, BlockNumber blkno);
 
 #endif   /* BRIN_XLOG_H */
diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h
index 187d68b..3653ec4 100644
--- a/src/include/access/generic_xlog.h
+++ b/src/include/access/generic_xlog.h
@@ -40,5 +40,6 @@ extern void GenericXLogAbort(GenericXLogState *state);
 extern void generic_redo(XLogReaderState *record);
 extern const char *generic_identify(uint8 info);
 extern void generic_desc(StringInfo buf, XLogReaderState *record);
+extern void generic_mask(char *page, BlockNumber blkno);
 
 #endif   /* GENERIC_XLOG_H */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 5629c8a..b72bfe8 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -79,5 +79,6 @@ extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
+extern void gin_mask(char *page, BlockNumber blkno);
 
 #endif   /* GIN_H */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 60a770a..8801e34 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -459,6 +459,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
+extern void gist_mask(char *page, BlockNumber blkno);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 52f28b8..07732eb 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -373,6 +373,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap_identify(uint8 info);
+extern void heap_mask(char *page, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..b9e1a76 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -774,5 +774,6 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_mask(char *page, BlockNumber blkno);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index ff7fe62..10d94e5 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
 	symname,
 
 typedef enum RmgrIds
@@ -33,3 +33,5 @@ typedef enum RmgrIds
 #define RM_MAX_ID				(RM_NEXT_ID - 1)
 
 #endif   /* RMGR_H */
+
+extern bool consistencyCheck_is_enabled(RmgrId rmid);
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 5f76749..b892aea 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
  */
 
 /* symbol name, textual name, redo, desc, identify, startup, cleanup */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index aaf78bc..3b2a0a7 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -219,5 +219,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
+extern void spg_mask(char *page, BlockNumber blkno);
 
 #endif   /* SPGIST_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a425572..9f036c7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -105,6 +105,8 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool *wal_consistency_checking;
+extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 8ad4d47..97bbc4c 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -266,6 +266,10 @@ typedef enum
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
  * record, if available (e.g. the last block).
  *
+ * rm_mask uses in input a page associated to the resource manager's records
+ * and performs masking actions on it for consistency check comparisons.
+ * The input must be an already allocated page of size BLCKSZ.
+ *
  * RmgrTable[] is indexed by RmgrId values (see rmgrlist.h).
  */
 typedef struct RmgrData
@@ -276,6 +280,7 @@ typedef struct RmgrData
 	const char *(*rm_identify) (uint8 info);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
+	void		(*rm_mask) (char *page, BlockNumber blkno);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 00102e8..20aa375 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -52,6 +52,7 @@ typedef struct
 
 	/* Information on full-page image, if any */
 	bool		has_image;
+	bool		apply_image;	/* Restore image during WAL replay */
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
@@ -205,6 +206,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 	((decoder)->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
 	((decoder)->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id) \
+	((decoder)->blocks[block_id].apply_image)
 
 extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst);
 extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 0162f93..b9aec21 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -56,8 +56,8 @@ typedef struct XLogRecord
 
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
- * are set internally by XLogInsert.
+ * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
+ * XLogInsert caller. The rest are set internally by XLogInsert.
  */
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
@@ -71,6 +71,15 @@ typedef struct XLogRecord
 #define XLR_SPECIAL_REL_UPDATE	0x01
 
 /*
+ * Enforces consistency checks of replayed WAL at recovery. If enabled,
+ * each record will log a full-page write for each block modified by the
+ * record and will reuse it afterwards for consistency checks. The caller
+ * of XLogInsert can use this value if necessary, note that if
+ * wal_consistency_checking is enabled for a rmgr this is set unconditionally.
+ */
+#define XLR_CHECK_CONSISTENCY	0x02
+
+/*
  * Header info for block data appended to an XLOG record.
  *
  * 'data_length' is the length of the rmgr-specific payload data associated
@@ -137,6 +146,7 @@ typedef struct XLogRecordBlockImageHeader
 /* Information stored in bimg_info */
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */
+#define BKPIMAGE_APPLY		0x04	/* page image should be restored during replay */
 
 /*
  * Extra header information used when page image has "hole" and
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 144c3c2..efc4a51 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -62,5 +62,6 @@ extern void ResetSequenceCaches(void);
 extern void seq_redo(XLogReaderState *rptr);
 extern void seq_desc(StringInfo buf, XLogReaderState *rptr);
 extern const char *seq_identify(uint8 info);
+extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif   /* SEQUENCE_H */
diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h
new file mode 100644
index 0000000..fdf5a19
--- /dev/null
+++ b/src/include/storage/bufmask.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *	  Definitions for buffer masking routines, used to mask certain bits
+ *	  in a page which can be different when the WAL is generated
+ *	  and when the WAL is applied. So, we mask those bits before any
+ *	  page comparison to make them consistent.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/storage/bufmask.h
+ */
+
+#ifndef BUFMASK_H
+#define BUFMASK_H
+
+#include "postgres.h"
+#include "storage/block.h"
+#include "storage/bufmgr.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER		0
+
+extern void mask_page_lsn(Page page);
+extern void mask_page_hint_bits(Page page);
+extern void mask_unused_space(Page page);
+extern void mask_lp_flags(Page page);
+extern void mask_page_content(Page page);
+
+#endif
-- 
1.8.3.1

#116Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#115)
Re: WAL consistency check facility

On Wed, Feb 8, 2017 at 1:25 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Thank you Robert for the review. Please find the updated patch in the
attachment.

I have committed this patch after fairly extensive revisions:

* Rewrote the documentation to give some idea what the underlying
mechanism of operation of the feature is, so that users who choose to
enable this will hopefully have some understanding of what they've
turned on.
* Renamed "char *page" arguments to "char *pagedata" and "Page page",
because tempPage doesn't seem to be to be any better a name than
page_norm.
* Moved bufmask.c to src/backend/access/common, because there's no
code in src/backend/storage/buffer that knows anything about the
format of pages; that is the job of AMs, hence src/backend/access.
* Improved some comments in bufmask.c
* Removed consistencyCheck_is_enabled in favor of determining which
RMs support masking by the presence of absence of an rm_mask function.
* Removed assertion in checkXLogConsistency that consistency checking
is enabled for the RM; that's trivially false if
wal_consistency_checking is not the same on the master and the
standby. (Note that quite apart from the issue of whether this
function should exist at all, adding it to a header file after the
closing #endif guard is certainly not right.)
* Changed checkXLogConsistency to use RBM_NORMAL_NO_LOG instead of
RBM_NORMAL. I'm not sure if there are any cases where this makes a
difference, but it seems safer.
* Changed checkXLogConsistency to skip pages whose LSN is newer than
that of the record. Without this, if you shut down recovery and
restart it, it complains of inconsistent pages and dies. (I'm not
sure this is the only scenario that needs to be covered; it would be
smart to do more testing of restarting the standby.)
* Made wal_consistency_checking a developer option instead of a WAL
option. Even though it CAN be used in production, we don't
particularly want to encourage that; enabling WAL consistency checking
has a big performance cost and makes your system more fragile not less
-- a WAL consistency failure causes your standby to die a hard death.
(Maybe there should be a way to suppress consistency checking on the
standby -- but I think not just by requiring wal_consistency_checking
on both ends. Or maybe we should just downgrade the FATAL to WARNING;
blowing up the standby irrevocably seems like poor behavior.)
* Coding style improvement in check_wal_consistency_checking.
* Removed commas in messages added to pg_xlogdump; those didn't look
good to me, on further review.
* Comment improvements in xlog_internal.h and xlogreader.h

I also bumped XLOG_PAGE_MAGIC (which is normally done by the
committer, not the patch author, so I wasn't expecting that to be in
the patch as submitted).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#116)
Re: WAL consistency check facility

On Thu, Feb 9, 2017 at 2:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 8, 2017 at 1:25 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Thank you Robert for the review. Please find the updated patch in the
attachment.

I have committed this patch after fairly extensive revisions:

Thank you, Robert, for the above corrections and commit. Thanks to
Michael Paquier, Peter Eisentraut, Amit Kapila, Álvaro Herrera, and
Simon Riggs for taking their time to complete the patch. It was a
great learning experience for me.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#116)
1 attachment(s)
Re: WAL consistency check facility

On Thu, Feb 9, 2017 at 5:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 8, 2017 at 1:25 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Thank you Robert for the review. Please find the updated patch in the
attachment.

I have committed this patch after fairly extensive revisions:

Cool. I had finally a look at what has been committed in a507b869.
Running regression tests with all RMGRs enabled, a single installcheck
generates 7GB of WAL. Woah.

* Rewrote the documentation to give some idea what the underlying
mechanism of operation of the feature is, so that users who choose to
enable this will hopefully have some understanding of what they've
turned on.

Thanks, those look good to me.

* Renamed "char *page" arguments to "char *pagedata" and "Page page",
because tempPage doesn't seem to be to be any better a name than
page_norm.

* Removed assertion in checkXLogConsistency that consistency checking
is enabled for the RM; that's trivially false if
wal_consistency_checking is not the same on the master and the
standby. (Note that quite apart from the issue of whether this
function should exist at all, adding it to a header file after the
closing #endif guard is certainly not right.)

I recall doing those two things the same way as in the commit. Not
sure at which point they have been re-introduced.

* Changed checkXLogConsistency to skip pages whose LSN is newer than
that of the record. Without this, if you shut down recovery and
restart it, it complains of inconsistent pages and dies. (I'm not
sure this is the only scenario that needs to be covered; it would be
smart to do more testing of restarting the standby.)

Good point.

-- a WAL consistency failure causes your standby to die a hard death.
(Maybe there should be a way to suppress consistency checking on the
standby -- but I think not just by requiring wal_consistency_checking
on both ends. Or maybe we should just downgrade the FATAL to WARNING;
blowing up the standby irrevocably seems like poor behavior.)

Having a FATAL is useful for buildfarm members, that would show up in
red. Having a switch to generate a warning would be useful for live
deployments I agree. Now I think that we need as well two things:
- A recovery test to run regression tests with a standby behind.
- Extend the TAP tests so as it is possible to fill in postgresql.conf
with custom variables.
- have the buildfarm client run recovery tests!
I am fine to write those patches.

I also bumped XLOG_PAGE_MAGIC (which is normally done by the
committer, not the patch author, so I wasn't expecting that to be in
the patch as submitted).

Here are a couple of things I have noticed while looking at the code.

+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
s/2016/2017/ in bufmask.c and bufmask.h.

+       if (ItemIdIsNormal(iid))
+       {
+
+           HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
Unnecessary newline here.
+        * Read the contents from the backup copy, stored in WAL record and
+        * store it in a temporary page. There is not need to allocate a new
+        * page here, a local buffer is fine to hold its contents and a mask
+        * can be directly applied on it.
s/not need/no need/.

In checkXLogConsistency(), FPWs that have the flag BKPIMAGE_APPLY set
will still be checked, resulting in a FPW being compared to itself. I
think that those had better be bypassed.

Please find attached a patch with those fixes.
--
Michael

Attachments:

consistency-checks-fix.patchapplication/octet-stream; name=consistency-checks-fix.patchDownload
diff --git a/src/backend/access/common/bufmask.c b/src/backend/access/common/bufmask.c
index 3b06115e03..b579bb8db4 100644
--- a/src/backend/access/common/bufmask.c
+++ b/src/backend/access/common/bufmask.c
@@ -5,7 +5,7 @@
  *	  in a page which can be different when the WAL is generated
  *	  and when the WAL is applied.
  *
- * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
  *
  * Contains common routines required for masking a page.
  *
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0be48fb3ee..af258366a2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9167,7 +9167,6 @@ heap_mask(char *pagedata, BlockNumber blkno)
 
 		if (ItemIdIsNormal(iid))
 		{
-
 			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
 
 			/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2dcff7f54b..f23e108628 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1360,6 +1360,16 @@ checkXLogConsistency(XLogReaderState *record)
 
 		Assert(XLogRecHasBlockImage(record, block_id));
 
+		if (XLogRecBlockImageApply(record, block_id))
+		{
+			/*
+			 * WAL record has already applied the page, so bypass the
+			 * consistency check as that would result in comparing the full
+			 * page stored in the record with itself.
+			 */
+			continue;
+		}
+
 		/*
 		 * Read the contents from the current buffer and store it in a
 		 * temporary page.
@@ -1390,7 +1400,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 		/*
 		 * Read the contents from the backup copy, stored in WAL record and
-		 * store it in a temporary page. There is not need to allocate a new
+		 * store it in a temporary page. There is no need to allocate a new
 		 * page here, a local buffer is fine to hold its contents and a mask
 		 * can be directly applied on it.
 		 */
diff --git a/src/include/access/bufmask.h b/src/include/access/bufmask.h
index add2dc0cd1..da6542d357 100644
--- a/src/include/access/bufmask.h
+++ b/src/include/access/bufmask.h
@@ -7,7 +7,7 @@
  *	  individual rmgr, but we make things easier by providing some
  *	  common routines to handle cases which occur in multiple rmgrs.
  *
- * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
  *
  * src/include/access/bufmask.h
  *
#119Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#118)
Re: WAL consistency check facility

On Thu, Feb 9, 2017 at 8:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Please find attached a patch with those fixes.

Committed, but I changed the copyright dates to 2016-2017 rather than
just 2017 since surely some of the code was originally written before
2017. Even that might not really be going back far enough, but it
doesn't matter too much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#119)
Re: WAL consistency check facility

On Wed, Feb 15, 2017 at 2:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Feb 9, 2017 at 8:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Please find attached a patch with those fixes.

Committed, but I changed the copyright dates to 2016-2017 rather than
just 2017 since surely some of the code was originally written before
2017. Even that might not really be going back far enough, but it
doesn't matter too much.

Just for curiosity: does the moment when the code has been written or
committed counts? It's no big deal seeing how liberal the Postgres
license is, but this makes me wonder...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#120)
Re: WAL consistency check facility

On Tue, Feb 14, 2017 at 5:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Feb 15, 2017 at 2:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Feb 9, 2017 at 8:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Please find attached a patch with those fixes.

Committed, but I changed the copyright dates to 2016-2017 rather than
just 2017 since surely some of the code was originally written before
2017. Even that might not really be going back far enough, but it
doesn't matter too much.

Just for curiosity: does the moment when the code has been written or
committed counts? It's no big deal seeing how liberal the Postgres
license is, but this makes me wonder...

IANAL, but I think if you ask one, he or she will tell you that what
matters is the date the work was created. In the case of code, that
means when the code was written.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#121)
Re: WAL consistency check facility

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Feb 14, 2017 at 5:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Just for curiosity: does the moment when the code has been written or
committed counts? It's no big deal seeing how liberal the Postgres
license is, but this makes me wonder...

IANAL, but I think if you ask one, he or she will tell you that what
matters is the date the work was created. In the case of code, that
means when the code was written.

FWIW, my own habit when creating new PG files is generally to write

* Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California

even if it's "all new" code. The main reason being that it's hardly ever
the case that you didn't copy-and-paste some amount of stuff out of a
pre-existing file, and trying to sort out how much of what originated
exactly when is an unrewarding exercise. Even if it is basically all
new code, this feels like giving an appropriate amount of credit to
Those Who Went Before Us.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#122)
Re: WAL consistency check facility

On Tue, Feb 14, 2017 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Feb 14, 2017 at 5:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Just for curiosity: does the moment when the code has been written or
committed counts? It's no big deal seeing how liberal the Postgres
license is, but this makes me wonder...

IANAL, but I think if you ask one, he or she will tell you that what
matters is the date the work was created. In the case of code, that
means when the code was written.

FWIW, my own habit when creating new PG files is generally to write

* Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California

even if it's "all new" code. The main reason being that it's hardly ever
the case that you didn't copy-and-paste some amount of stuff out of a
pre-existing file, and trying to sort out how much of what originated
exactly when is an unrewarding exercise. Even if it is basically all
new code, this feels like giving an appropriate amount of credit to
Those Who Went Before Us.

Right. I tend to do the same, and wonder if we shouldn't make that a
general practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#123)
Re: WAL consistency check facility

On Wed, Feb 15, 2017 at 11:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 14, 2017 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, my own habit when creating new PG files is generally to write

* Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California

even if it's "all new" code. The main reason being that it's hardly ever
the case that you didn't copy-and-paste some amount of stuff out of a
pre-existing file, and trying to sort out how much of what originated
exactly when is an unrewarding exercise. Even if it is basically all
new code, this feels like giving an appropriate amount of credit to
Those Who Went Before Us.

Right. I tend to do the same, and wonder if we shouldn't make that a
general practice.

This looks sensible to me. No-brainer rules that make sense are less
things to worry about.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers