WAL replay bugs

Started by Heikki Linnakangasalmost 12 years ago47 messages

hlinnakangas@vmware.com

almost 12 years ago

I've been playing with a little hack that records a before and after
image of every page modification that is WAL-logged, and writes the
images to a file along with the LSN of the corresponding WAL record. I
set up a master-standby replication with that hack in place in both
servers, and ran the regression suite. Then I compared the after images
after every WAL record, as written on master, and as replayed by the
standby.

The idea is that the page content in the standby after replaying a WAL
record should be identical to the page in the master, when the WAL
record was generated. There are some known cases where that doesn't
hold, but it's a useful sanity check. To reduce noise, I've been
focusing on one access method at a time, filtering out others.

I did that for GIN first, and indeed found a bug in my new
incomplete-split code, see commit 594bac42. After fixing that, and
zeroing some padding bytes (38a2b95c), I'm now getting a clean run with
that.

Next, I took on GiST, and lo-and-behold found a bug there pretty quickly
as well. This one has been there ever since we got Hot Standby: the redo
of a page update (e.g an insertion) resets the right-link of the page.
If there is a concurrent scan, in a hot standby server, that scan might
still need the rightlink, and will hence miss some tuples. This can be
reproduced like this:

1. in master, create test table.

CREATE TABLE gisttest (id int4);
CREATE INDEX gisttest_idx ON gisttest USING gist (id);
INSERT INTO gisttest SELECT g * 1000 from generate_series(1, 100000) g;

-- Test function. Starts a scan, fetches one row from it, then waits 10
seconds until fetching the rest of the rows.
-- Returns the number of rows scanned. Should be 100000 if you follow
-- these test instructions.
CREATE OR REPLACE FUNCTION gisttestfunc() RETURNS int AS
$$
declare
i int4;
t text;
cur CURSOR FOR SELECT 'foo' FROM gisttest WHERE id >= 0;
begin
set enable_seqscan=off; set enable_bitmapscan=off;

i = 0;
OPEN cur;
FETCH cur INTO t;

perform pg_sleep(10);

LOOP
EXIT WHEN NOT FOUND; -- this is bogus on first iteration
i = i + 1;
FETCH cur INTO t;
END LOOP;
CLOSE cur;
RETURN i;
END;
$$ LANGUAGE plpgsql;

2. in standby

SELECT gisttestfunc();
<blocks>

3. Quickly, before the scan in standby continues, cause some page splits:

INSERT INTO gisttest SELECT g * 1000+1 from generate_series(1, 100000) g;

4. The scan in standby finishes. It should return 100000, but will
return a lower number if you hit the bug.

At a quick glance, I think fixing that is just a matter of not resetting
the right-link. I'll take a closer look tomorrow, but for now I just
wanted to report what I've been doing. I'll post the scripts I've been
using later too - nag me if I don't.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

almost 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: WAL replay bugs

On 04/07/2014 02:16 PM, Heikki Linnakangas wrote:

I've been playing with a little hack that records a before and after
image of every page modification that is WAL-logged, and writes the
images to a file along with the LSN of the corresponding WAL record. I
set up a master-standby replication with that hack in place in both
servers, and ran the regression suite. Then I compared the after images
after every WAL record, as written on master, and as replayed by the
standby.

This is awesome ... thank you for doing this.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMb05976d7732ebb80dbf61808c1cba1e3b78255f95deccbd9201a1b4760501dac71fd9ec08f23c68eef988c5260e0c9da@asav-1.01.com

Michael Paquier

michael.paquier@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: WAL replay bugs

On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I've been playing with a little hack that records a before and after image
of every page modification that is WAL-logged, and writes the images to a
file along with the LSN of the corresponding WAL record. I set up a
master-standby replication with that hack in place in both servers, and ran
the regression suite. Then I compared the after images after every WAL
record, as written on master, and as replayed by the standby.

Assuming that adding some dedicated hooks in the core able to do
actions before and after a page modification occur is not *that*
costly (well I imagine that it is not acceptable in terms of
performance), could it be possible to get that in the shape of a
extension that could be used to test WAL record consistency? This may
be an idea to think about...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

sachin kotwal

kotsachin@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: WAL replay bugs

I executed given steps many times to produce this bug.
But still I unable to hit this bug.
I used attached scripts to produce this bug.

Can I get scripts to produce this bug?

wal_replay_bug.sh
<http://postgresql.1045698.n5.nabble.com/file/n5799512/wal_replay_bug.sh>

-----
Thanks and Regards,

Sachin Kotwal
NTT-DATA-OSS Center (Pune)
--
View this message in context: http://postgresql.1045698.n5.nabble.com/WAL-replay-bugs-tp5799053p5799512.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: sachin kotwal (#4)

1 attachment(s)

Re: WAL replay bugs

On 04/10/2014 10:52 AM, sachin kotwal wrote:

I executed given steps many times to produce this bug.
But still I unable to hit this bug.
I used attached scripts to produce this bug.

Can I get scripts to produce this bug?

wal_replay_bug.sh
<http://postgresql.1045698.n5.nabble.com/file/n5799512/wal_replay_bug.sh>

Oh, I can't reproduce it using that script either. I must've used some
variation of it, and posted wrong script.

The attached seems to do the trick. I changed the INSERT statements
slightly, so that all the new rows have the same key.

Thanks for verifying this!

- Heikki

Sachin D. Kotwal

kotsachin@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#5)

Re: WAL replay bugs

On Thu, Apr 10, 2014 at 6:21 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 04/10/2014 10:52 AM, sachin kotwal wrote:

I executed given steps many times to produce this bug.
But still I unable to hit this bug.
I used attached scripts to produce this bug.

Can I get scripts to produce this bug?

Oh, I can't reproduce it using that script either. I must've used some
variation of it, and posted wrong script.

The attached seems to do the trick. I changed the INSERT statements
slightly, so that all the new rows have the same key.

Thanks for verifying this!

Thanks to explain the case to produce this bug.
I am able to produce this bug by using latest scripts from last mail.
I applied patch submitted for this bug and re-run the scripts.
Now it is giving correct result.

Thanks and Regards,

Sachin Kotwal

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Michael Paquier (#3)

Re: WAL replay bugs

On 04/08/2014 06:41 AM, Michael Paquier wrote:

On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I've been playing with a little hack that records a before and after image
of every page modification that is WAL-logged, and writes the images to a
file along with the LSN of the corresponding WAL record. I set up a
master-standby replication with that hack in place in both servers, and ran
the regression suite. Then I compared the after images after every WAL
record, as written on master, and as replayed by the standby.

Assuming that adding some dedicated hooks in the core able to do
actions before and after a page modification occur is not *that*
costly (well I imagine that it is not acceptable in terms of
performance), could it be possible to get that in the shape of a
extension that could be used to test WAL record consistency? This may
be an idea to think about...

Yeah, working on it. It can live as a patch set if nothing else.

This has been very fruitful, I just committed another fix for a bug I
found with this earlier today.

There are quite a few things that cause differences between master and
standby. We have hint bits in many places, unused space that isn't
zeroed etc.

Two things that are not bugs, but I'd like to change just to make this
tool easier to maintain, and to generally clean things up:

1. When creating a sequence, we first use simple_heap_insert() to insert
the sequence tuple, which creates a WAL record. Then we write a new
sequence RM WAL record about the same thing. The reason is that the WAL
record written by regular heap_insert is bogus for a sequence tuple.
After replaying just the heap insertion, but not the other record, the
page doesn't have the magic value indicating that it's a sequence, i.e.
it's broken as a sequence page. That's OK because we only do this when
creating a new sequence, so if we crash between those two records, the
whole relation is not visible to anyone. Nevertheless, I'd like to fix
that by using PageAddItem directly to insert the tuple, instead of
simple_heap_insert. We have to override the xmin field of the tuple
anyway, and we don't need any of the other services like finding the
insert location, toasting, visibility map or freespace map updates, that
simple_heap_insert() provides.

2. _bt_restore_page, when restoring a B-tree page split record. It adds
tuples to the page in reverse order compared to how it's done in master.
There is a comment noting that, and it asks "Is it worth changing just
on general principles?". Yes, I think it is.

Any objections to changing those two?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Heikki Linnakangas (#7)

Re: WAL replay bugs

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Two things that are not bugs, but I'd like to change just to make this
tool easier to maintain, and to generally clean things up:

1. When creating a sequence, we first use simple_heap_insert() to insert
the sequence tuple, which creates a WAL record. Then we write a new
sequence RM WAL record about the same thing. The reason is that the WAL
record written by regular heap_insert is bogus for a sequence tuple.
After replaying just the heap insertion, but not the other record, the
page doesn't have the magic value indicating that it's a sequence, i.e.
it's broken as a sequence page. That's OK because we only do this when
creating a new sequence, so if we crash between those two records, the
whole relation is not visible to anyone. Nevertheless, I'd like to fix
that by using PageAddItem directly to insert the tuple, instead of
simple_heap_insert. We have to override the xmin field of the tuple
anyway, and we don't need any of the other services like finding the
insert location, toasting, visibility map or freespace map updates, that
simple_heap_insert() provides.

2. _bt_restore_page, when restoring a B-tree page split record. It adds
tuples to the page in reverse order compared to how it's done in master.
There is a comment noting that, and it asks "Is it worth changing just
on general principles?". Yes, I think it is.

Any objections to changing those two?

Not here. I've always suspected #2 was going to bite us someday anyway.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

over 11 years ago

In reply to: Tom Lane (#8)

Re: WAL replay bugs

On Thu, Apr 17, 2014 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any objections to changing those two?

Not here. I've always suspected #2 was going to bite us someday anyway.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Heikki Linnakangas (#7)

1 attachment(s)

Re: WAL replay bugs

On 04/17/2014 07:59 PM, Heikki Linnakangas wrote:

On 04/08/2014 06:41 AM, Michael Paquier wrote:

On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I've been playing with a little hack that records a before and after image
of every page modification that is WAL-logged, and writes the images to a
file along with the LSN of the corresponding WAL record. I set up a
master-standby replication with that hack in place in both servers, and ran
the regression suite. Then I compared the after images after every WAL
record, as written on master, and as replayed by the standby.

Assuming that adding some dedicated hooks in the core able to do
actions before and after a page modification occur is not *that*
costly (well I imagine that it is not acceptable in terms of
performance), could it be possible to get that in the shape of a
extension that could be used to test WAL record consistency? This may
be an idea to think about...

Yeah, working on it. It can live as a patch set if nothing else.

This has been very fruitful, I just committed another fix for a bug I
found with this earlier today.

There are quite a few things that cause differences between master and
standby. We have hint bits in many places, unused space that isn't
zeroed etc.

[a few more fixed bugs later]

Ok, I'm now getting clean output when running the regression suite with
this tool.

And here is the tool itself. It consists of two parts:

1. Modifications to the backend to write the page images
2. A post-processing tool to compare the logged images between master
and standby.

The attached diff contains both parts. The postprocessing tool is in
contrib/page_image_logging. See contrib/page_image_logging/README for
instructions. Let me know if you have any questions or need further help
running the tool.

I've also pushed this to my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch
"page_image_logging". I intend to keep it up-to-date with current master.

This is a pretty ugly hack, so I'm not proposing to commit this in the
current state. But perhaps this could be done more cleanly, by adding
some hooks in the backend as Michael suggested.
- Heikki

Attachments:

page_image_logging-1.patchtext/x-diff; name=page_image_logging-1.patchDownload

diff --git a/contrib/page_image_logging/Makefile b/contrib/page_image_logging/Makefile
new file mode 100644
index 0000000..9c68bbc
--- /dev/null
+++ b/contrib/page_image_logging/Makefile
@@ -0,0 +1,20 @@
+# contrib/page_image_logging/Makefile
+
+PGFILEDESC = "postprocess-images - "
+
+PROGRAM = postprocess-images
+OBJS	= postprocess-images.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir)
+PG_LIBS = $(libpq_pgport)
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/postprocess-images
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/page_image_logging/README b/contrib/page_image_logging/README
new file mode 100644
index 0000000..2c3d271
--- /dev/null
+++ b/contrib/page_image_logging/README
@@ -0,0 +1,50 @@
+Usage
+-----
+
+1. Apply the patch
+
+2. Set up a master and standby.
+
+3. stop master, then standby.
+
+4. Remove $PGDATA/buffer-images from both servers.
+
+5. Start master and standby
+
+6. Run "make installcheck", or whatever you want to test
+
+7. Stop master, then standby
+
+8. compare the logged page images using the postprocessing tool:
+
+./postprocess-images ~/data-master/buffer-images ~/data-standby/buffer-images  > differences
+
+9. The 'differences' file should be empty. If not, investigate.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+If you get errors like this from postprocess-images:
+
+    could not reorder line XXX
+
+It can be caused by an all-zeros page being logged with XLOG HEAP_NEWPAGE
+records. Look at the line in the buffer-image file, see if it's all-zeros.
+This can happen e.g when you change the tablespace of a table. See
+log_newpage() in heapam.c.
+
+You can use pg_xlogdump to see which WAL record a page image corresponds
+to. But beware that the LSN in the page image points to the *end* of the
+WAL record, while the LSN that pg_xlogdump prints is the *beginning* of
+the WAL record. So to find which WAL record a page image corresponds to,
+find the LSN from the page image in pg_xlogdump output, and back off one
+record. (you can't just grep for the line containing the LSN).
diff --git a/contrib/page_image_logging/postprocess-images.c b/contrib/page_image_logging/postprocess-images.c
new file mode 100644
index 0000000..6b4ab4c
--- /dev/null
+++ b/contrib/page_image_logging/postprocess-images.c
@@ -0,0 +1,578 @@
+#include "postgres_fe.h"
+
+typedef uintptr_t Datum;
+#include "access/htup_details.h"
+#include "access/nbtree.h"
+#include "storage/bufpage.h"
+
+#define LINESZ (BLCKSZ*2 + 100)
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits. Our strategy is to normalize all pages by creating a
+ * mask of those bits that are not expected to match.
+ */
+
+/*
+ * Build a mask that covers unused space between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page, uint8 *mask)
+{
+	int			pd_lower = ((PageHeader) page)->pd_lower;
+	int			pd_upper = ((PageHeader) page)->pd_upper;
+	int			pd_special = ((PageHeader) page)->pd_upper;
+
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		fprintf(stderr, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(mask + pd_lower, 0xFF, pd_upper - pd_lower);
+}
+
+static void
+build_heap_mask(char *page, uint8 *mask)
+{
+	OffsetNumber off;
+	PageHeader mask_phdr = (PageHeader) mask;
+
+	mask_unused_space(page, mask);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	mask_phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	mask_phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	mask_phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId		iid = PageGetItemId(page, off);
+		char	   *mask_item;
+
+		mask_item = (char *) (mask + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader mask_htup = (HeapTupleHeader) mask_item;
+
+			mask_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			mask_htup->t_infomask |= HEAP_COMBOCID;
+			mask_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int		len = ItemIdGetLength(iid);
+			int		padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(mask_item + len, 0xFF, padlen);
+		}
+	}
+}
+
+static void
+build_spgist_mask(char *page, uint8 *mask)
+{
+	mask_unused_space(page, mask);
+}
+
+static void
+build_gist_mask(char *page, uint8 *mask)
+{
+	mask_unused_space(page, mask);
+}
+
+static void
+build_gin_mask(BlockNumber blkno, char *page, uint8 *mask)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page, mask);
+}
+
+static void
+build_sequence_mask(char *page, uint8 *mask)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(mask, 0xFF, BLCKSZ);
+}
+
+static void
+build_btree_mask(char *page, uint8 *mask)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+
+	mask_unused_space(page, mask);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* page content, between standard page header and opaque struct */
+		memset(mask + SizeOfPageHeaderData, 0xFF, BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) mask)->pd_lower, 0xFF, sizeof(uint16));
+		memset(&((PageHeader) mask)->pd_upper, 0xFF, sizeof(uint16));
+	}
+	else
+	{
+		/* Mask DEAD line pointer bits */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+			ItemIdData m;
+
+			if (ItemIdIsDead(iid))
+			{
+				memset(&m, 0, sizeof(ItemIdData));
+				m.lp_flags = 2;
+
+				memcpy((char *) mask + (((char *) iid) - page), &m, sizeof(ItemIdData));
+			}
+		}
+	}
+
+	/* Mask BTP_HAS_GARBAGE flag */
+	{
+		BTPageOpaque maskopaq = (BTPageOpaque) (((char *) mask) + ((PageHeader) page)->pd_special);
+
+		maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+	}
+}
+
+static inline unsigned char
+parsehex(char digit, bool *success)
+{
+	if (digit >= '0' && digit <= '9')
+	{
+		*success = true;
+		return (unsigned char) (digit - '0');
+	}
+	if (digit >= 'a' && digit <= 'f')
+	{
+		*success = true;
+		return (unsigned char) (digit - 'a' + 10);
+	}
+	if (digit >= 'A' && digit <= 'F')
+	{
+		*success = true;
+		return (unsigned char) (digit - 'A' + 10);
+	}
+	*success = false;
+	return 0;
+}
+
+static inline void
+tohex(uint8 byte, char *out)
+{
+	const char *digits = "0123456789ABCDEF";
+
+	out[0] = digits[byte >> 4];
+	out[1] = digits[byte & 0x0F];
+}
+
+/*
+ * Mask any known changing parts, like hint bits, from the line. The line
+ * is modified in place. Full nibbles to be ignored are set to 'X' in the
+ * hex output, and individiual bits are set to 0.
+ */
+static void
+maskline(char *line)
+{
+	char		page[BLCKSZ];
+	uint8		mask[BLCKSZ];
+	int			i;
+	uint16		tail;
+	char	   *pstart;
+	char	   *p;
+	BlockNumber blkno;
+
+	/* Parse the line */
+	p = strstr(line, " blk: ");
+	if (p == NULL)
+		return;
+
+	sscanf(p, " blk: %u", &blkno);
+
+	pstart = strstr(line, "after: ");
+	if (pstart == NULL)
+		return;
+	pstart += strlen("after: ");
+
+	/* Decode the hex-encoded page back to raw bytes */
+	p = pstart;
+	for (i = 0; i < BLCKSZ; i++)
+	{
+		bool		success;
+		unsigned char c;
+
+		c = parsehex(*(p++), &success) << 4;
+		if (!success)
+			return;
+		c |= parsehex(*(p++), &success);
+		if (!success)
+			return;
+
+		page[i] = (char) c;
+	}
+
+	/*
+	 * Ok, we now have the original block contents in 'page'. Look at the
+	 * size of the special area, and the last two bytes in it, to detect
+	 * what kind of a page it is. Call the appropriate masking function.
+	 */
+
+	/* begin with an empty mask */
+	memset(mask, 0, BLCKSZ);
+
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+
+	/* Try to detect what kind of a page it is */
+	if (PageGetSpecialSize(page) == 0)
+	{
+		build_heap_mask(page, mask);
+	}
+	else if (PageGetSpecialSize(page) == 16)
+	{
+		if (tail == 0xFF81)
+			build_gist_mask(page, mask);
+		else if (tail <= 0xFF7F)
+			build_btree_mask(page, mask);
+	}
+	else if (PageGetSpecialSize(page) == 8)
+	{
+		if (tail == 0xFF82)
+			build_spgist_mask(page, mask);
+		else if (*((uint32 *) (page + BLCKSZ - MAXALIGN(sizeof(uint32)))) == 0x1717)
+			build_sequence_mask(page, mask);
+		else
+			build_gin_mask(blkno, page, mask);
+	}
+
+	/* Apply the mask, replacing masked nibbles with # */
+	for (i = 0; i < BLCKSZ; i++)
+	{
+		uint8 c;
+
+		if (mask[i] == 0)
+			continue;
+
+		c = ((uint8) page[i]) & ~mask[i];
+
+		tohex(c, &pstart[2 * i]);
+
+		if ((mask[i] & 0xF0) == 0xF0)
+			pstart[2 * i] = '#';
+		if ((mask[i] & 0x0F) == 0x0F)
+			pstart[2 * i + 1] = '#';
+	}
+}
+
+
+
+/* ----------------------------------------------------------------
+ * Line reordering
+ *
+ * When the page images are logged in master and standby, they are
+ * not necessarily written out in the same order. For example, if a
+ * single WAL-logged operation modifies multiple pages, like an index
+ * page split, the standby might release the locks in different order
+ * than the master. Another cause is concurrent operations; writing
+ * the page images is not atomic with WAL insertion, so if two
+ * backends are running concurrently, their modifications in the
+ * image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines are read into a reorder buffer, and sorted
+ * there. Sorting the whole file would be overkill, as the lines are
+ * mostly in order already. The fixed-size reorder buffer works as
+ * long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ */
+
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	    *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* number of lines currently in buffer */
+
+	FILE	   *fp;
+	int			lineno;		/* current input line number (for debugging) */
+	bool		eof;		/* have we reached EOF from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the file into the reorder buffer, until the buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+		buf->lineno++;
+
+		/* common case: the new line goes to the end */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* find the right place in the queue */
+			int			i;
+
+			for (i = buf->nlines - 2; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+			if (i < 0)
+			{
+				fprintf(stderr, "could not reorder line %d\n", buf->lineno);
+				pg_free(linebuf);
+				continue;
+			}
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->lineno = 0;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		sscanf(buf->lines[nlines], "LSN: %X/%08X", &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+
+	return nlines;
+}
+
+static void
+freerecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+static bool
+diffrecords(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	if (nlines_a != nlines_b)
+		return true;
+	for (i = 0; i < nlines_a; i++)
+	{
+		/* First try a straight byte-per-byte comparison. */
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			/* They were not byte-per-byte identical. Try again after masking */
+			maskline(lines_a[i]);
+			maskline(lines_b[i]);
+			if (strcmp(lines_a[i], lines_b[i]) != 0)
+				return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's image file> <standby's image file>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	fp_a = fopen(argv[1], "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "could not open file \"%s\"\n", argv[1]);
+		exit(1);
+	}
+	fp_b = fopen(argv[2], "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "could not open file \"%s\"\n", argv[2]);
+		exit(1);
+	}
+
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecords(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7217e96..3c8ed7b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6974,8 +6974,8 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 	recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_NEWPAGE, rdata);
 
 	/*
-	 * The page may be uninitialized. If so, we can't set the LSN because
-	 * that would corrupt the page.
+	 * The page may be uninitialized. If so, we can't set the LSN and TLI
+	 * because that would corrupt the page.
 	 */
 	if (!PageIsNew(page))
 	{
@@ -6984,6 +6984,12 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+	/*
+	 * the normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	log_page_change(page, rnode, forkNum, blkno);
+
 	return recptr;
 }
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0d806af..48cb809 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -996,9 +996,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	/* rightpage was already initialized by _bt_getbuf */
 
 	/*
-	 * Copy the original page's LSN into leftpage, which will become the
-	 * updated version of the page.  We need this because XLogInsert will
-	 * examine the LSN and possibly dump it in a page image.
+	 * Copy the original page's LSN and TLI into leftpage, which will become
+	 * the updated version of the page.  We need this because XLogInsert will
+	 * examine these fields and possibly dump them in a page image.
 	 */
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4e46ddb..6dc2383 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -112,6 +112,154 @@ static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static int	rnode_comparator(const void *p1, const void *p2);
 
+/* Support for capturing changes to pages */
+typedef struct
+{
+	Buffer		buffer;
+	char		content[BLCKSZ];
+} BufferImage;
+
+#define MAX_BEFORE_IMAGES		100
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+int		   before_images_cnt = 0;
+
+static FILE *imagefp;
+
+static bool
+log_before_after_images(char *msg, BufferImage *img)
+{
+	Page		newcontent = BufferGetPage(img->buffer);
+	Page		oldcontent = (Page) img->content;
+	XLogRecPtr	oldlsn;
+	XLogRecPtr	newlsn;
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+
+	oldlsn = PageGetLSN(oldcontent);
+	newlsn = PageGetLSN(newcontent);
+
+	if (oldlsn == newlsn)
+	{
+		/* no change */
+		return false;
+	}
+
+	BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+
+	log_page_change(newcontent, &rnode, forknum, blkno);
+
+	return true;
+}
+
+void
+log_page_change(char *newcontent, RelFileNode *rnode, int forknum, uint32 blkno)
+{
+	XLogRecPtr newlsn = PageGetLSN((Page) newcontent);
+	int			i;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	fprintf(imagefp, "LSN: %X/%08X, rel: %u/%u/%u, blk: %u; ",
+			(uint32) (newlsn >> 32), (uint32) newlsn,
+			rnode->spcNode, rnode->dbNode, rnode->relNode,
+			blkno);
+	if (forknum != MAIN_FORKNUM)
+		fprintf(imagefp, "forknum: %u; ", forknum);
+
+	/* write the page contents, in hex */
+	{
+		char		buf[BLCKSZ * 2];
+		int			j = 0;
+
+		fprintf(imagefp, "after: ");
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8		byte = (uint8) newcontent[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+
+	fprintf(imagefp, "\n");
+	fflush(imagefp);
+
+	LWLockRelease(SyncScanLock);
+}
+
+static void
+remember_before_image(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy (img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * Forget a buffer image. If the page was modified, log the new contents.
+ */
+static void
+forget_image(Buffer buffer)
+{
+	int			i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+
+			log_before_after_images("forget", before_images[i]);
+			if (i != before_images_cnt)
+			{
+				/* swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+			before_images_cnt--;
+
+			return;
+		}
+	}
+	elog(LOG, "could not find image for buffer %u", buffer);
+}
+
+/*
+ * See if any of the buffers we've memorized have changed.
+ */
+void
+log_page_changes(char *msg)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (log_before_after_images(msg, img))
+		{
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
@@ -1729,6 +1877,10 @@ AtEOXact_Buffers(bool isCommit)
 	}
 #endif
 
+	if (before_images_cnt > 0)
+		elog(LOG, "released all page-images (AtEOXact_Buffers)");
+	before_images_cnt = 0;
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1744,7 +1896,18 @@ AtEOXact_Buffers(bool isCommit)
 void
 InitBufferPoolBackend(void)
 {
+	int			i;
+	BufferImage *images;
+
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen("buffer-images", "ab");
 }
 
 /*
@@ -2761,6 +2924,18 @@ LockBuffer(Buffer buffer, int mode)
 	buf = &(BufferDescriptors[buffer - 1]);
 
 	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * XXX: peek into the LWLock struct to see if we're holding it in
+		 * exclusive or shared mode. This is concurrency-safe: if we're holding
+		 * it in exclusive mode, no-one else can release it. If we're holding
+		 * it in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			forget_image(buffer);
+	}
+
+	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(buf->content_lock, LW_SHARED);
@@ -2768,6 +2943,9 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		remember_before_image(buffer);
 }
 
 /*
@@ -2779,6 +2957,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	res;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2786,7 +2965,11 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	res = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	if (res)
+		remember_before_image(buffer);
+
+	return res;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 36b4b8b..85f3fdc 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1238,6 +1238,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+	if (before_images_cnt > 0)
+		elog(LOG, "released all page images");
+	before_images_cnt = 0;
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0d61b82..30c055a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -87,6 +87,14 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+/* in bufmgr.c, related to capturing page images */
+extern void log_page_changes(char *msg);
+struct RelFileNode;
+extern void log_page_change(char *newcontent, struct RelFileNode *rnode, int forknum, uint32 blkno);
+
+extern int before_images_cnt;
+
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
@@ -120,6 +128,8 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		log_page_changes("END_CRIT_SECTION");		\
 } while(0)

#11

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#10)

Re: WAL replay bugs

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

And here is the tool itself. It consists of two parts:

1. Modifications to the backend to write the page images
2. A post-processing tool to compare the logged images between master and
standby.

Having that into Postgres at the disposition of developers would be
great, and I believe that it would greatly reduce the occurrence of
bugs caused by WAL replay during recovery. So, with the permission of
the author, I have been looking at this facility for a cleaner
integration into Postgres.

Roughly, this utility is made of three parts:
1) A set of masking functions that can be used on page images to
normalize them. This is used to put magic numbers or enforce flag
values to make page content consistent across nodes. This is for
example the case of the free space between pd_lower and pd_upper,
pd_flags, etc. Of course this depends on the type of page (btree,
heap, etc.).
2) Facility to memorize, analyze if they have been modified, and flush
page images to a dedicated file. This interacts with the buffer
manager mainly.
3) Facility to reorder page images within the same WAL record as
master/standby may not write them in the same order on a standby or a
master due to for example lock released in different order. This is
part of the binary analyzing the diffs between master and standby.

As of now, 2) is integrated in the backend, 1) and 3) are part of the
contrib module. However I am thinking that 1) and 2) should be done in
core using an ifdef similar to CLOBBER_FREED_MEMORY, to mask the page
images and write them in a dedicated file (in global/ ?), while 3)
would be fine as a separate binary in contrib/. An essential thing to
add would be to have a set of regression tests that developers and
buildfarm machines could directly use.

Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#11)

1 attachment(s)

Re: WAL replay bugs

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a symbol
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in
test-default.sh but user is free to set up custom tests by creating a
file called test-custom.sh that can be kicked by the test facility if
this file is present instead of the defaults.

Patch will be added to the first commit fest as well. Note that the
footprint on core code is limited, so even if there is more than 1k
lines of codes, review is simpler than it looks.

A couple of things to note though:
1) In order to detect if a page is used for a sequence, SEQ_MAGIC
needs to be exposed in sequence.h. This is included in the patch
attached but perhaps this should be changed as a separate patch
2) Regression test facility uses some useful parts taken from
pg_upgrade. I think that we should gather those parts in a common
place (contrib/common?). This can facilitate the integration of other
modules using regression based on bash scripts.
3) While hacking this facility, I noticed that some ItemId entries in
btree pages could be inconsistent between master and standby. Those
items are masked in the current patch, but it looks like a bug of
Postgres itself.

Documentation is added in the code itself, I didn't feel any need to
expose this facility the lambda users in doc/src/sgml...
Regards,
--
Michael

Attachments:

0001-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/plain; charset=US-ASCII; name=0001-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload

From ae5b957ae33648afda6e936801d8cc23d7469954 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 13 Jun 2014 15:54:41 +0900
Subject: [PATCH] Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines that
can be used to check for consistency at page level when replaying WAL
files among several nodes of a cluster (generally master and standby
node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then each
buffer is captured is with the following format as a single line of
the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between pages,
and format is chosen to facilitate comparison between buffer entries.
- A client part, located in contrib/buffer_capture_cmp, that can be used
to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a CFLAGS
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in test-default.sh
but user is free to set up custom tests by creating a file called
test-custom.sh that can be kicked by the test facility if this file
is present instead of the defaults.
---
 contrib/Makefile                                |   1 +
 contrib/buffer_capture_cmp/.gitignore           |   9 +
 contrib/buffer_capture_cmp/Makefile             |  32 ++
 contrib/buffer_capture_cmp/README               |  34 ++
 contrib/buffer_capture_cmp/buffer_capture_cmp.c | 346 +++++++++++++++++
 contrib/buffer_capture_cmp/test-default.sh      |  14 +
 contrib/buffer_capture_cmp/test.sh              | 203 ++++++++++
 src/backend/access/heap/heapam.c                |  11 +
 src/backend/commands/sequence.c                 |   5 -
 src/backend/storage/buffer/Makefile             |   2 +-
 src/backend/storage/buffer/bufcapt.c            | 482 ++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c             |  40 +-
 src/backend/storage/lmgr/lwlock.c               |   8 +
 src/backend/storage/page/bufpage.c              |   3 +
 src/include/commands/sequence.h                 |   5 +
 src/include/miscadmin.h                         |  13 +
 src/include/storage/bufcapt.h                   |  65 ++++
 17 files changed, 1266 insertions(+), 7 deletions(-)
 create mode 100644 contrib/buffer_capture_cmp/.gitignore
 create mode 100644 contrib/buffer_capture_cmp/Makefile
 create mode 100644 contrib/buffer_capture_cmp/README
 create mode 100644 contrib/buffer_capture_cmp/buffer_capture_cmp.c
 create mode 100644 contrib/buffer_capture_cmp/test-default.sh
 create mode 100644 contrib/buffer_capture_cmp/test.sh
 create mode 100644 src/backend/storage/buffer/bufcapt.c
 create mode 100644 src/include/storage/bufcapt.h

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..1c8e6b9 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		btree_gin	\
 		btree_gist	\
+		buffer_capture_cmp \
 		chkpass		\
 		citext		\
 		cube		\
diff --git a/contrib/buffer_capture_cmp/.gitignore b/contrib/buffer_capture_cmp/.gitignore
new file mode 100644
index 0000000..ecd8b78
--- /dev/null
+++ b/contrib/buffer_capture_cmp/.gitignore
@@ -0,0 +1,9 @@
+# Binary generated
+/buffer_capture_cmp
+
+# Regression tests
+/capture_differences.txt
+/tmp_check
+
+# Custom test file
+/test-custom.sh
diff --git a/contrib/buffer_capture_cmp/Makefile b/contrib/buffer_capture_cmp/Makefile
new file mode 100644
index 0000000..da4316b
--- /dev/null
+++ b/contrib/buffer_capture_cmp/Makefile
@@ -0,0 +1,32 @@
+# contrib/buffer_capture_cmp/Makefile
+
+PGFILEDESC = "buffer_capture_cmp - Comparator tool between buffer captures"
+PGAPPICON = win32
+
+PROGRAM = buffer_capture_cmp
+OBJS	= buffer_capture_cmp.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir) -DFRONTEND
+PG_LIBS = $(libpq_pgport)
+
+EXTRA_CLEAN = tmp_check/ capture_differences.txt
+
+# test.sh creates a cluster dedicated for the test
+EXTRA_REGRESS_OPTS=--use-existing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/buffer_capture_cmp
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# Tests can be done only when BUFFER_CAPTURE is defined
+ifneq (,$(filter -DBUFFER_CAPTURE,$(CFLAGS)))
+check: test.sh all
+	MAKE=$(MAKE) bindir=$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) ./test.sh
+endif
diff --git a/contrib/buffer_capture_cmp/README b/contrib/buffer_capture_cmp/README
new file mode 100644
index 0000000..1039c69
--- /dev/null
+++ b/contrib/buffer_capture_cmp/README
@@ -0,0 +1,34 @@
+buffer_capture_cmp
+------------------
+
+This facility contains already contains a set of regression tests that
+can run be by default. This simple command is enough to run the tests:
+
+    make check
+
+The code contains a hook that can be used as an entry point to run some
+custom tests using this facility. Simply create in this folder a file
+called test-custom.sh and execute all the commands necessary for the
+tests. This script can use the node number of the master node available
+as the first argument of the script when it is run within the test
+suite.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+It is possible to use pg_xlogdump to see which WAL record a page image
+corresponds to. But beware that the LSN in the page image points to the
+*end* of the WAL record, while the LSN that pg_xlogdump prints is the
+*beginning* of the WAL record. So to find which WAL record a page image
+corresponds to, find the LSN from the page image in pg_xlogdump output,
+and back off one record. (you can't just grep for the line containing
+the LSN).
diff --git a/contrib/buffer_capture_cmp/buffer_capture_cmp.c b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
new file mode 100644
index 0000000..2695597
--- /dev/null
+++ b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
@@ -0,0 +1,346 @@
+/*-------------------------------------------------------------------------
+ *
+ * buffer_capture_cmp.c
+ *	  Utility tool to compare buffer captures between two nodes of
+ *	  a cluster. This utility needs to be run on servers whose code
+ *	  has been built with the symbol BUFFER_CAPTURE defined.
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    contrib/buffer_capture_cmp/buffer_capture_cmp.c
+ *
+ * Capture files that can be obtained on nodes of a cluster do not
+ * necessarily have the page images logged in the same order. For
+ * example, if a single WAL-logged operation modifies multiple pages,
+ * like an index page split, the standby might release the locks
+ * in different order than the master. Another cause is concurrent
+ * operations; writing the page images is not atomic with WAL insertion,
+ * so if two backends are running concurrently, their modifications in
+ * the image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines from the capture file are read into a reorder
+ * buffer, and sorted there. Sorting the whole file would be overkill,
+ * as the lines are mostly in order already. The fixed-size reorder
+ * buffer works as long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ *
+ * If symbol BUFFER_CAPTURE is not defined, this utility does nothing.
+ */
+
+#include "postgres_fe.h"
+#include "port.h"
+#include "storage/bufcapt.h"
+
+#ifdef BUFFER_CAPTURE
+
+/* Size of a single entry of the capture file */
+#define LINESZ (BLCKSZ*2 + 31)
+
+/* Line reordering stuff */
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	   *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* Number of lines currently in buffer */
+
+	FILE	   *fp;
+	bool		eof;		/* Has EOF been reached from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the capture file into the reorder buffer, until the
+ * buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+
+		/*
+		 * Common case, entry goes to the end. This happens for an
+		 * initialization or when buffer compares to be higher than
+		 * the last buffer in queue.
+		 */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* Find the right place in the queue */
+			int			i;
+
+			/*
+			 * Scan all the existing buffers and find where buffer needs
+			 * to be included. We already know that the comparison result
+			 * we the last buffer in list.
+			 */
+			for (i = buf->nlines - 1; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+
+			/* Place buffer correctly in the list */
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+/*
+ * Initialize a reorder buffer.
+ */
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		/* Fetch LSN from the first 8 bytes of the buffer */
+		sscanf(buf->lines[nlines], "LSN: %08X/%08X",
+			   &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* Consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+	return nlines;
+}
+
+/*
+ * Free all the given records.
+ */
+static void
+freerecord(char **lines, int nlines)
+{
+	int                     i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+/*
+ * Print out given records.
+ */
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+/*
+ * Do a direct comparison between the two given records that have the
+ * same LSN entry.
+ */
+static bool
+diffrecord(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	/* Leave if they do not have the same number of records */
+	if (nlines_a != nlines_b)
+		return true;
+
+	/*
+	 * Now do a comparison line by line. If any diffs are found at
+	 * character-level they will be printed out.
+	 */
+	for (i = 0; i < nlines_a; i++)
+	{
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			int strlen_a = strlen(lines_a[i]);
+			int strlen_b = strlen(lines_b[i]);
+			int j;
+
+			printf("strlen_a: %d, strlen_b: %d\n",
+				   strlen_a, strlen_b);
+			for (j = 0; j < strlen_a; j++)
+			{
+				char char_a = lines_a[i][j];
+				char char_b = lines_b[i][j];
+				if (char_a != char_b)
+					printf("position: %d, char_a: %c, char_b: %c\n",
+						   j, char_a, char_b);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's data folder> <standby's data folder>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	char	   *path_a, *path_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	/* Open first file */
+	path_a = pg_strdup(argv[1]);
+	canonicalize_path(path_a);
+	path_a = psprintf("%s/%s", path_a, BUFFER_CAPTURE_FILE);
+	fp_a = fopen(path_a, "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_a);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Open second file */
+	path_b = pg_strdup(argv[2]);
+	canonicalize_path(path_b);
+	path_b = psprintf("%s/%s", path_b, BUFFER_CAPTURE_FILE);
+	fp_b = fopen(path_b, "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_b);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Initialize buffers for first loop */
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* Read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	/* Do comparisons as long as there are entries */
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* Compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecord(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("Lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
+
+#else
+int
+main(int argc, char **argv)
+{
+	return 0;
+}
+#endif /* BUFFER_CAPTURE */
diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
new file mode 100644
index 0000000..5bec503
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+# Default test suite for buffer_compare_cmp
+
+# PGPORT is already set so process should refer to that when kicking tests
+
+# In order to run the regression test suite, copy this file as test-custom.sh
+# and then uncomment the following lines:
+# echo ROOT_DIR=$PWD
+# psql -c 'CREATE DATABASE regression'
+# cd ../../src/test/regress && make installcheck 2>&1 > /dev/null
+# cd $ROOT_DIR
+
+# Create a simple table
+psql -c 'CREATE TABLE aa AS SELECT generate_series(1, 10) AS a'
diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
new file mode 100644
index 0000000..8a28f0a
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -0,0 +1,203 @@
+#!/bin/bash
+
+# contrib/buffer_capture_cmp/test.sh
+#
+# Test driver for buffer_capture_cmp. It does the following processing:
+#
+# 1) Initialization of a master and a standby
+# 2) Stop master, then standby
+# 3) Remove $PGDATA/buffer_capture on master and standby
+# 4) Start master, then standby
+# 5) Run custom or default series of tests
+# 6) Stop master, then standby
+# 7) Compare the buffer capture of both nodes
+# 8) The diffence file should be empty
+#
+# Portions Copyright (c) 2006-2014, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+
+# Leave immediately in case of an error
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*) LISTEN_ADDRESSES="localhost" ;;
+	*)      LISTEN_ADDRESSES="" ;;
+esac
+
+# Adjust these paths for your environment
+TESTROOT=$PWD/tmp_check
+TEST_MASTER=$TESTROOT/data_master
+TEST_STANDBY=$TESTROOT/data_standby
+
+# Create the root folder for test data
+if [ ! -d $TESTROOT ]; then
+	mkdir -p $TESTROOT
+fi
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return non-zero from unset if the variable
+# is already unset. Since we are operating under 'set -e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+export PGDATABASE="postgres"
+
+# Set up PATH correctly
+PATH=$bindir:$PATH
+export PATH
+
+newsrc=`cd ../.. && pwd`
+# Calculate port to use as a base for calculations
+PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' $newsrc/src/include/pg_config.h | awk '{print $3}'`
+PG_BASE_PORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+# Get free port for master node
+PG_MASTER_PORT=$PG_BASE_PORT
+while psql -X postgres -p $PG_MASTER_PORT </dev/null 2>/dev/null
+do
+	i=`expr $i + 1`
+	if [ $i -eq 16 ]
+	then
+		echo port $PG_MASTER_PORT apparently in use
+		exit 1
+	fi
+	PG_MASTER_PORT=`expr $PG_MASTER_PORT + 1`
+done
+export PGPORT=$PG_MASTER_PORT
+
+# Get free port for standby node
+PG_STANDBY_PORT=`expr $PG_MASTER_PORT + 1`
+while psql -X postgres -p $PG_STANDBY_PORT </dev/null 2>/dev/null
+do
+	i=`expr $i + 1`
+	if [ $i -eq 16 ]
+	then
+		echo port $PG_STANDBY_PORT apparently in use
+		exit 1
+	fi
+	PG_STANDBY_PORT=`expr $PG_STANDBY_PORT + 1`
+done
+
+# Enable echo so the user can see what is being executed
+set -x
+
+# Set up the nodes, first the master
+rm -rf $TEST_MASTER
+initdb -N -A trust -D $TEST_MASTER
+
+# Custom parameters for master's postgresql.conf
+cat >> $TEST_MASTER/postgresql.conf <<EOF
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+checkpoint_segments = 50
+shared_buffers = 1MB
+log_line_prefix = 'M  %m %p '
+hot_standby = on
+autovacuum = off
+max_connections = 50
+listen_addresses = '$LISTEN_ADDRESSES'
+port = $PG_MASTER_PORT
+EOF
+
+# Accept replication connections on master
+cat >> $TEST_MASTER/pg_hba.conf <<EOF
+local replication all trust
+host replication all 127.0.0.1/32 trust
+host replication all ::1/128 trust
+EOF
+
+# Start master
+pg_ctl -w -D $TEST_MASTER start
+
+# Now the standby
+echo "Master initialized and running."
+
+# Set up standby with necessary parameters
+rm -rf $TEST_STANDBY
+
+# Base backup is taken with xlog files included
+pg_basebackup -D $TEST_STANDBY -p $PG_MASTER_PORT -x
+echo "port = $PG_STANDBY_PORT" >> $TEST_STANDBY/postgresql.conf
+
+cat > $TEST_STANDBY/recovery.conf <<EOF
+primary_conninfo='port=$PG_MASTER_PORT'
+standby_mode=on
+recovery_target_timeline='latest'
+EOF
+
+# Start standby
+pg_ctl -w -D $TEST_STANDBY start
+
+# Stop both nodes and remove the file containing the buffer captures
+# Master needs to be stopped first.
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+rm -rf $TEST_MASTER/buffer_captures
+rm -rf $TEST_STANDBY/buffer_captures
+
+# Re-start the nodes
+pg_ctl -w -D $TEST_MASTER start
+pg_ctl -w -D $TEST_STANDBY start
+
+# Check the presence of custom tests and kick them in priority. If not,
+# fallback to the default tests. Tests need only to be run on the master
+# node.
+if [ -f ./test-custom.sh ]; then
+	. ./test-custom.sh
+else
+	. ./test-default.sh
+fi
+
+# Time to stop the nodes as tests have run
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+
+DIFF_FILE=capture_differences.txt
+
+# Now compare the buffer images
+# Disable erroring here, buffer capture file may not be present
+# if cluster has not been built with symbol BUFFER_CAPTURE
+set +e
+./buffer_capture_cmp $TEST_MASTER $TEST_STANDBY > $DIFF_FILE
+ERR_NUM=$?
+
+# Cover the case where capture file does not exist
+if [ $ERR_NUM == 2 ]; then
+	echo "Capture file does not exist"
+	echo "PASSED"
+	exit 0
+elif [ $ERR_NUM == 1 ]; then
+	echo "FAILED"
+	exit 1
+fi
+
+# No need to echo commands anymore
+set +x
+echo
+
+# Test passes if there are no diffs!
+if [ ! -s $DIFF_FILE ]; then
+	echo "PASSED"
+    exit 0
+else
+	echo "Diffs exist in the buffer captures"
+	echo "FAILED"
+	exit 1
+fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b77c32c..8f5a450 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,9 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -7002,6 +7005,14 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+#ifdef BUFFER_CAPTURE
+	/*
+	 * The normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	buffer_capture_write(page, blkno);
+#endif
+
 	return recptr;
 }
 
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..2134eae 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -45,11 +45,6 @@
  */
 #define SEQ_LOG_VALS	32
 
-/*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
 typedef struct sequence_magic
 {
 	uint32		magic;
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..6ec85b0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufcapt.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufcapt.c b/src/backend/storage/buffer/bufcapt.c
new file mode 100644
index 0000000..9e5dfc1
--- /dev/null
+++ b/src/backend/storage/buffer/bufcapt.c
@@ -0,0 +1,482 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.c
+ *	  Routines for buffer capture, including masking and dynamic buffer
+ *	  snapshot.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufcapt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufcapt.h"
+#include "storage/bufmgr.h"
+#include "storage/lwlock.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER				0xFF
+
+/* Support for capturing changes to pages per process */
+#define MAX_BEFORE_IMAGES		100
+
+typedef struct
+{
+	Buffer		  buffer;
+	char			content[BLCKSZ];
+} BufferImage;
+
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+static FILE *imagefp;
+static int before_images_cnt = 0;
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ */
+static void mask_unused_space(char *page);
+static void mask_heap_page(char *page);
+static void mask_spgist_page(char *page);
+static void mask_gist_page(char *page);
+static void mask_gin_page(BlockNumber blkno, char *page);
+static void mask_sequence_page(char *page);
+static void mask_btree_page(char *page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page)
+{
+	int		 pd_lower = ((PageHeader) page)->pd_lower;
+	int		 pd_upper = ((PageHeader) page)->pd_upper;
+	int		 pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(char *page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	  iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int	 len = ItemIdGetLength(iid);
+			int	 padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(BlockNumber blkno, char *page)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(char *page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(char *page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+		(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be better with more refinement.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/* ----------------------------------------------------------------
+ * Buffer capture functions.
+ *
+ * Those functions can be used to memmorize the content of pages
+ * and flush them to BUFFER_CAPTURE_FILE when necessary.
+ */
+static bool
+buffer_capture_is_changed(BufferImage *img)
+{
+	Page			newcontent = BufferGetPage(img->buffer);
+	Page			oldcontent = (Page) img->content;
+
+	if (PageGetLSN(oldcontent) == PageGetLSN(newcontent))
+		return false;
+	return true;
+}
+
+/*
+ * Initialize page capture
+ */
+void
+buffer_capture_init(void)
+{
+	int				i;
+	BufferImage	   *images;
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen(BUFFER_CAPTURE_FILE, "ab");
+}
+
+/*
+ * buffer_capture_reset
+ *
+ * Reset buffer captures
+ */
+void
+buffer_capture_reset(void)
+{
+	if (before_images_cnt > 0)
+		elog(LOG, "Released all buffer captures");
+	before_images_cnt = 0;
+}
+
+/*
+ * buffer_capture_write
+ *
+ * Flush to file the new content page present here after applying a
+ * mask on it.
+ */
+void
+buffer_capture_write(char *newcontent,
+					 uint32 blkno)
+{
+	XLogRecPtr	newlsn = PageGetLSN((Page) newcontent);
+	char		page[BLCKSZ];
+	uint16		tail;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	/* Copy content of page before any operation */
+	memcpy(page, newcontent, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		mask_heap_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)))
+	{
+		/*
+		 * It happens that btree and gist have the same size of special
+		 * area.
+		 */
+		if (tail == GIST_PAGE_ID)
+			mask_gist_page(page);
+		else if (tail <= MAX_BT_CYCLE_ID)
+			mask_btree_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == 8)
+	{
+		/*
+		 * XXX: Page detection for sequences can be improved.
+		 */
+		if (tail == SPGIST_PAGE_ID)
+			mask_spgist_page(page);
+		else if (*((uint32 *) (page + BLCKSZ - MAXALIGN(sizeof(uint32)))) == SEQ_MAGIC)
+			mask_sequence_page(page);
+		else
+			mask_gin_page(blkno, page);
+	}
+
+	/*
+	 * First write the LSN in a constant format to facilitate comparisons
+	 * between buffer captures.
+	 */
+	fprintf(imagefp, "LSN: %08X/%08X ",
+			(uint32) (newlsn >> 32), (uint32) newlsn);
+
+	/* Then write the page contents, in hex */
+	fprintf(imagefp, "page: ");
+	{
+		char	buf[BLCKSZ * 2];
+		int     j = 0;
+		int		i;
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte = (uint8) page[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+	fprintf(imagefp, "\n");
+
+	/* Then the masked page in hex format */
+	fflush(imagefp);
+
+	/* Clean up */
+	LWLockRelease(SyncScanLock);
+}
+
+/*
+ * buffer_capture_remember
+ *
+ * Append a page content to the existing list of buffers to-be-captured.
+ */
+void
+buffer_capture_remember(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy(img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * buffer_capture_forget
+ *
+ * Forget a page image. If the page was modified, log the new contents.
+ */
+void
+buffer_capture_forget(Buffer buffer)
+{
+	int	i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+			/* If page has new content, capture it */
+			if (buffer_capture_is_changed(img))
+			{
+				Page content = BufferGetPage(img->buffer);
+				RelFileNode	rnode;
+				ForkNumber	forknum;
+				BlockNumber	blkno;
+
+				BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+				buffer_capture_write(content, blkno);
+			}
+
+			if (i != before_images_cnt)
+			{
+				/* Swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+
+			before_images_cnt--;
+			return;
+		}
+	}
+	elog(LOG, "could not find image for buffer %u", buffer);
+}
+
+/*
+ * buffer_capture_scan
+ *
+ * See if any of the buffers that have been memorized have changed.
+ */
+void
+buffer_capture_scan(void)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (buffer_capture_is_changed(img))
+		{
+			Page content = BufferGetPage(img->buffer);
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+
+			BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+			buffer_capture_write(content, blkno);
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
+
+#endif /* BUFFER_CAPTURE */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..b1b2467 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -50,6 +50,9 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -1728,6 +1731,10 @@ AtEOXact_Buffers(bool isCommit)
 	}
 #endif
 
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1744,6 +1751,10 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef BUFFER_CAPTURE
+	buffer_capture_init();
+#endif
 }
 
 /*
@@ -2759,6 +2770,20 @@ LockBuffer(Buffer buffer, int mode)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * XXX: peek into the LWLock struct to see if we're holding it in
+		 * exclusive or shared mode. This is concurrency-safe: if we're holding
+		 * it in exclusive mode, no-one else can release it. If we're holding
+		 * it in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			buffer_capture_forget(buffer);
+	}
+#endif
+
 	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
@@ -2767,6 +2792,11 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		buffer_capture_remember(buffer);
+#endif
 }
 
 /*
@@ -2778,6 +2808,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	res;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2785,7 +2816,14 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	res = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+
+#ifdef BUFFER_CAPTURE
+	if (res)
+		buffer_capture_remember(buffer);
+#endif
+
+	return res;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d23ac62..32762a6 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -41,6 +41,10 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
+
 #ifdef LWLOCK_STATS
 #include "utils/hsearch.h"
 #endif
@@ -1240,6 +1244,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..6beaa15 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -21,6 +21,9 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		ignore_checksum_failure = false;
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 7d8a370..73b0ede 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,11 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * "special area" identifier of a sequence's buffer page
+ */
+#define SEQ_MAGIC     0x1717
+
 
 typedef struct FormData_pg_sequence
 {
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..1ae98f7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -116,11 +116,24 @@ do { \
 
 #define START_CRIT_SECTION()  (CritSectionCount++)
 
+#ifdef BUFFER_CAPTURE
+/* in src/backend/storage/buffer/bufcapt.c */
+void buffer_capture_scan(void);
+
+#define END_CRIT_SECTION() \
+do { \
+	Assert(CritSectionCount > 0); \
+	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		buffer_capture_scan(); \
+} while(0)
+#else
 #define END_CRIT_SECTION() \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
 } while(0)
+#endif
 
 
 /*****************************************************************************
diff --git a/src/include/storage/bufcapt.h b/src/include/storage/bufcapt.h
new file mode 100644
index 0000000..089e5a7
--- /dev/null
+++ b/src/include/storage/bufcapt.h
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcaptr.h
+ *	  Buffer capture definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufcapt.h
+ *
+ * About BUFFER_CAPTURE:
+ *
+ * If this symbol is defined, all page images that are logged on this
+ * server are as well flushed to BUFFER_CAPTURE_FILE. One line of the
+ * output file is used for a single page image.
+ *
+ * The page images obtained are aimed to be used with the utility tool
+ * called buffer_capture_cmp available in contrib/ and can be used to
+ * compare how WAL is replayed between master and standby nodes, helping
+ * in spotting bugs in this area. As each page is written in hexadecimal
+ * format, one single page writes BLCKSZ * 2 bytes to the capture file.
+ * Hexadecimal format makes it easier to spot differences between captures
+ * done among nodes. Be aware that this has a large impact on I/O and that
+ * it is aimed only for buildfarm and development purposes.
+ *
+ * One single page entry has the following format:
+ *	LSN: %08X/08X page: PAGE_IN_HEXA
+ *
+ * The LSN corresponds to the log sequence number to which the page image
+ * applies to, then the content of the image is added as-is. This format
+ * is chosen to facilitate comparisons between each capture entry
+ * particularly in cases where LSN increases its digit number.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFCAPT_H
+#define BUFCAPT_H
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Output file where buffer captures are stored
+ */
+#define BUFFER_CAPTURE_FILE "buffer_captures"
+
+void buffer_capture_init(void);
+void buffer_capture_reset(void);
+void buffer_capture_write(char *newcontent,
+						  uint32 blkno);
+
+void buffer_capture_remember(Buffer buffer);
+void buffer_capture_forget(Buffer buffer);
+void buffer_capture_scan(void);
+
+#endif /* BUFFER_CAPTURE */
+
+#endif
-- 
2.0.0

#13

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Michael Paquier (#12)

Re: WAL replay bugs

On 06/13/2014 10:14 AM, Michael Paquier wrote:

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

Oh, you moved the masking code from the client tool to the backend. Why?
When debugging, it's useful to have the genuine, non-masked page image
available.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#13)

Re: WAL replay bugs

On Fri, Jun 13, 2014 at 4:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 06/13/2014 10:14 AM, Michael Paquier wrote:

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

Oh, you moved the masking code from the client tool to the backend. Why?
When debugging, it's useful to have the genuine, non-masked page image
available.

My thought is to share the CPU effort of masking between backends...
That's not a big deal to move them back to the client tool though.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#14)

Re: WAL replay bugs

On Fri, Jun 13, 2014 at 4:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jun 13, 2014 at 4:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 06/13/2014 10:14 AM, Michael Paquier wrote:

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

Oh, you moved the masking code from the client tool to the backend. Why?
When debugging, it's useful to have the genuine, non-masked page image
available.

My thought is to share the CPU effort of masking between backends...

And that having a set of API to do page masking on the server side
would be useful for extensions as well. Now that I recall this was one
of the first things that came to my mind when looking at this
facility, thinking that it would be useful to have them in a separate
file, with a dedicated header.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#12)

3 attachment(s)

Re: WAL replay bugs

On Fri, Jun 13, 2014 at 4:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

A couple of things to note though:
1) In order to detect if a page is used for a sequence, SEQ_MAGIC
needs to be exposed in sequence.h. This is included in the patch
attached but perhaps this should be changed as a separate patch
2) Regression test facility uses some useful parts taken from
pg_upgrade. I think that we should gather those parts in a common
place (contrib/common?). This can facilitate the integration of other
modules using regression based on bash scripts.
3) While hacking this facility, I noticed that some ItemId entries in
btree pages could be inconsistent between master and standby. Those
items are masked in the current patch, but it looks like a bug of
Postgres itself.

Attached are 3 patches doing exactly this separation for lisibility.
Regards,
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-to-sequence_h.patchtext/plain; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-to-sequence_h.patchDownload

From 310f7f9c9563e68c084131769e025b93db7fd91e Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 16 Jun 2014 10:38:57 +0900
Subject: [PATCH 1/3] Move SEQ_MAGIC to sequence.h

This can allow a backend process to detect if a page is being used
for a sequence.
---
 src/backend/commands/sequence.c | 5 -----
 src/include/commands/sequence.h | 4 ++++
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..2134eae 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -45,11 +45,6 @@
  */
 #define SEQ_LOG_VALS	32
 
-/*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
 typedef struct sequence_magic
 {
 	uint32		magic;
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 8819c00..3a69580 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,10 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * "special area" identifier of a sequence's buffer page
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
-- 
2.0.0

0002-Extract-generic-bash-initialization-process-from-pg_upgrade.patchtext/plain; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_upgrade.patchDownload

From 019455433b05f1fcd28eee7ff4dc14d13680d983 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 16 Jun 2014 11:54:59 +0900
Subject: [PATCH 2/3] Extract generic bash initialization process from
 pg_upgrade

Such initialization is useful as well for some other utilities and makes
test settings consistent.
---
 contrib/pg_upgrade/test.sh | 47 ++++--------------------------------
 src/test/shell/init_env.sh | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 43 deletions(-)
 create mode 100644 src/test/shell/init_env.sh

diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 9d31f9a..7b05500 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -9,24 +9,14 @@
 # Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 
-set -e
-
-: ${MAKE=make}
-
-# Guard against parallel make issues (see comments in pg_regress.c)
-unset MAKEFLAGS
-unset MAKELEVEL
-
-# Establish how the server will listen for connections
-testhost=`uname -s`
+# Initialize test
+. ../../src/test/shell/init_env.sh
 
 case $testhost in
 	MINGW*)
-		LISTEN_ADDRESSES="localhost"
 		PGHOST=""; unset PGHOST
 		;;
 	*)
-		LISTEN_ADDRESSES=""
 		# Select a socket directory.  The algorithm is from the "configure"
 		# script; the outcome mimics pg_regress.c:make_temp_sockdir().
 		PGHOST=$PG_REGRESS_SOCK_DIR
@@ -102,37 +92,8 @@ logdir=$PWD/log
 rm -rf "$logdir"
 mkdir "$logdir"
 
-# Clear out any environment vars that might cause libpq to connect to
-# the wrong postmaster (cf pg_regress.c)
-#
-# Some shells, such as NetBSD's, return non-zero from unset if the variable
-# is already unset. Since we are operating under 'set -e', this causes the
-# script to fail. To guard against this, set them all to an empty string first.
-PGDATABASE="";        unset PGDATABASE
-PGUSER="";            unset PGUSER
-PGSERVICE="";         unset PGSERVICE
-PGSSLMODE="";         unset PGSSLMODE
-PGREQUIRESSL="";      unset PGREQUIRESSL
-PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
-PGHOSTADDR="";        unset PGHOSTADDR
-
-# Select a non-conflicting port number, similarly to pg_regress.c
-PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' $newsrc/src/include/pg_config.h | awk '{print $3}'`
-PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
-export PGPORT
-
-i=0
-while psql -X postgres </dev/null 2>/dev/null
-do
-	i=`expr $i + 1`
-	if [ $i -eq 16 ]
-	then
-		echo port $PGPORT apparently in use
-		exit 1
-	fi
-	PGPORT=`expr $PGPORT + 1`
-	export PGPORT
-done
+# Get a port to run the tests
+pg_get_test_port $newsrc
 
 # buildfarm may try to override port via EXTRA_REGRESS_OPTS ...
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --port=$PGPORT"
diff --git a/src/test/shell/init_env.sh b/src/test/shell/init_env.sh
new file mode 100644
index 0000000..b10e19f
--- /dev/null
+++ b/src/test/shell/init_env.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+
+# src/test/shell/init.sh
+#
+# Utility initializing environment for tests to be conducted in shell.
+# The initialization done here is made to be platform-proof.
+
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*)	LISTEN_ADDRESSES="localhost" ;;
+	*)		LISTEN_ADDRESSES="" ;;
+esac
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return nonzero from unset if the variable
+# is already unset. Since we are operating under 'set e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+# Select a non-conflicting port number, similarly to pg_regress.c, and
+# save its value in PGPORT. Caller should either save or use this value
+# for the tests.
+pg_get_test_port()
+{
+	PG_ROOT_DIR=$1
+	PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' $PG_ROOT_DIR/src/include/pg_config.h | awk '{print $3}'`
+	PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+	export PGPORT
+
+	i=0
+	while psql -X postgres </dev/null 2>/dev/null
+	do
+		i=`expr $i + 1`
+		if [ $i -eq 16 ]
+		then
+			echo port $PGPORT apparently in use
+			exit 1
+		fi
+		PGPORT=`expr $PGPORT + 1`
+		export PGPORT
+	done
+}
-- 
2.0.0

0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/plain; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload

From 82a1c6ef8a3f531b5da398f49566c0a54d8f09a3 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 16 Jun 2014 12:15:06 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines that
can be used to check for consistency at page level when replaying WAL
files among several nodes of a cluster (generally master and standby
node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then each
buffer is captured is with the following format as a single line of
the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between pages,
and format is chosen to facilitate comparison between buffer entries.
- A client part, located in contrib/buffer_capture_cmp, that can be used
to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a CFLAGS
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in test-default.sh
but user is free to set up custom tests by creating a file called
test-custom.sh that can be kicked by the test facility if this file
is present instead of the defaults.
---
 contrib/Makefile                                |   1 +
 contrib/buffer_capture_cmp/.gitignore           |   9 +
 contrib/buffer_capture_cmp/Makefile             |  32 ++
 contrib/buffer_capture_cmp/README               |  34 ++
 contrib/buffer_capture_cmp/buffer_capture_cmp.c | 346 +++++++++++++++++
 contrib/buffer_capture_cmp/test-default.sh      |  14 +
 contrib/buffer_capture_cmp/test.sh              | 157 ++++++++
 src/backend/access/heap/heapam.c                |  11 +
 src/backend/storage/buffer/Makefile             |   2 +-
 src/backend/storage/buffer/bufcapt.c            | 482 ++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c             |  40 +-
 src/backend/storage/lmgr/lwlock.c               |   8 +
 src/backend/storage/page/bufpage.c              |   3 +
 src/include/miscadmin.h                         |  13 +
 src/include/storage/bufcapt.h                   |  65 ++++
 15 files changed, 1215 insertions(+), 2 deletions(-)
 create mode 100644 contrib/buffer_capture_cmp/.gitignore
 create mode 100644 contrib/buffer_capture_cmp/Makefile
 create mode 100644 contrib/buffer_capture_cmp/README
 create mode 100644 contrib/buffer_capture_cmp/buffer_capture_cmp.c
 create mode 100644 contrib/buffer_capture_cmp/test-default.sh
 create mode 100644 contrib/buffer_capture_cmp/test.sh
 create mode 100644 src/backend/storage/buffer/bufcapt.c
 create mode 100644 src/include/storage/bufcapt.h

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..1c8e6b9 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		btree_gin	\
 		btree_gist	\
+		buffer_capture_cmp \
 		chkpass		\
 		citext		\
 		cube		\
diff --git a/contrib/buffer_capture_cmp/.gitignore b/contrib/buffer_capture_cmp/.gitignore
new file mode 100644
index 0000000..ecd8b78
--- /dev/null
+++ b/contrib/buffer_capture_cmp/.gitignore
@@ -0,0 +1,9 @@
+# Binary generated
+/buffer_capture_cmp
+
+# Regression tests
+/capture_differences.txt
+/tmp_check
+
+# Custom test file
+/test-custom.sh
diff --git a/contrib/buffer_capture_cmp/Makefile b/contrib/buffer_capture_cmp/Makefile
new file mode 100644
index 0000000..da4316b
--- /dev/null
+++ b/contrib/buffer_capture_cmp/Makefile
@@ -0,0 +1,32 @@
+# contrib/buffer_capture_cmp/Makefile
+
+PGFILEDESC = "buffer_capture_cmp - Comparator tool between buffer captures"
+PGAPPICON = win32
+
+PROGRAM = buffer_capture_cmp
+OBJS	= buffer_capture_cmp.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir) -DFRONTEND
+PG_LIBS = $(libpq_pgport)
+
+EXTRA_CLEAN = tmp_check/ capture_differences.txt
+
+# test.sh creates a cluster dedicated for the test
+EXTRA_REGRESS_OPTS=--use-existing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/buffer_capture_cmp
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# Tests can be done only when BUFFER_CAPTURE is defined
+ifneq (,$(filter -DBUFFER_CAPTURE,$(CFLAGS)))
+check: test.sh all
+	MAKE=$(MAKE) bindir=$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) ./test.sh
+endif
diff --git a/contrib/buffer_capture_cmp/README b/contrib/buffer_capture_cmp/README
new file mode 100644
index 0000000..1039c69
--- /dev/null
+++ b/contrib/buffer_capture_cmp/README
@@ -0,0 +1,34 @@
+buffer_capture_cmp
+------------------
+
+This facility contains already contains a set of regression tests that
+can run be by default. This simple command is enough to run the tests:
+
+    make check
+
+The code contains a hook that can be used as an entry point to run some
+custom tests using this facility. Simply create in this folder a file
+called test-custom.sh and execute all the commands necessary for the
+tests. This script can use the node number of the master node available
+as the first argument of the script when it is run within the test
+suite.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+It is possible to use pg_xlogdump to see which WAL record a page image
+corresponds to. But beware that the LSN in the page image points to the
+*end* of the WAL record, while the LSN that pg_xlogdump prints is the
+*beginning* of the WAL record. So to find which WAL record a page image
+corresponds to, find the LSN from the page image in pg_xlogdump output,
+and back off one record. (you can't just grep for the line containing
+the LSN).
diff --git a/contrib/buffer_capture_cmp/buffer_capture_cmp.c b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
new file mode 100644
index 0000000..2695597
--- /dev/null
+++ b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
@@ -0,0 +1,346 @@
+/*-------------------------------------------------------------------------
+ *
+ * buffer_capture_cmp.c
+ *	  Utility tool to compare buffer captures between two nodes of
+ *	  a cluster. This utility needs to be run on servers whose code
+ *	  has been built with the symbol BUFFER_CAPTURE defined.
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    contrib/buffer_capture_cmp/buffer_capture_cmp.c
+ *
+ * Capture files that can be obtained on nodes of a cluster do not
+ * necessarily have the page images logged in the same order. For
+ * example, if a single WAL-logged operation modifies multiple pages,
+ * like an index page split, the standby might release the locks
+ * in different order than the master. Another cause is concurrent
+ * operations; writing the page images is not atomic with WAL insertion,
+ * so if two backends are running concurrently, their modifications in
+ * the image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines from the capture file are read into a reorder
+ * buffer, and sorted there. Sorting the whole file would be overkill,
+ * as the lines are mostly in order already. The fixed-size reorder
+ * buffer works as long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ *
+ * If symbol BUFFER_CAPTURE is not defined, this utility does nothing.
+ */
+
+#include "postgres_fe.h"
+#include "port.h"
+#include "storage/bufcapt.h"
+
+#ifdef BUFFER_CAPTURE
+
+/* Size of a single entry of the capture file */
+#define LINESZ (BLCKSZ*2 + 31)
+
+/* Line reordering stuff */
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	   *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* Number of lines currently in buffer */
+
+	FILE	   *fp;
+	bool		eof;		/* Has EOF been reached from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the capture file into the reorder buffer, until the
+ * buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+
+		/*
+		 * Common case, entry goes to the end. This happens for an
+		 * initialization or when buffer compares to be higher than
+		 * the last buffer in queue.
+		 */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* Find the right place in the queue */
+			int			i;
+
+			/*
+			 * Scan all the existing buffers and find where buffer needs
+			 * to be included. We already know that the comparison result
+			 * we the last buffer in list.
+			 */
+			for (i = buf->nlines - 1; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+
+			/* Place buffer correctly in the list */
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+/*
+ * Initialize a reorder buffer.
+ */
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		/* Fetch LSN from the first 8 bytes of the buffer */
+		sscanf(buf->lines[nlines], "LSN: %08X/%08X",
+			   &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* Consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+	return nlines;
+}
+
+/*
+ * Free all the given records.
+ */
+static void
+freerecord(char **lines, int nlines)
+{
+	int                     i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+/*
+ * Print out given records.
+ */
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+/*
+ * Do a direct comparison between the two given records that have the
+ * same LSN entry.
+ */
+static bool
+diffrecord(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	/* Leave if they do not have the same number of records */
+	if (nlines_a != nlines_b)
+		return true;
+
+	/*
+	 * Now do a comparison line by line. If any diffs are found at
+	 * character-level they will be printed out.
+	 */
+	for (i = 0; i < nlines_a; i++)
+	{
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			int strlen_a = strlen(lines_a[i]);
+			int strlen_b = strlen(lines_b[i]);
+			int j;
+
+			printf("strlen_a: %d, strlen_b: %d\n",
+				   strlen_a, strlen_b);
+			for (j = 0; j < strlen_a; j++)
+			{
+				char char_a = lines_a[i][j];
+				char char_b = lines_b[i][j];
+				if (char_a != char_b)
+					printf("position: %d, char_a: %c, char_b: %c\n",
+						   j, char_a, char_b);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's data folder> <standby's data folder>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	char	   *path_a, *path_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	/* Open first file */
+	path_a = pg_strdup(argv[1]);
+	canonicalize_path(path_a);
+	path_a = psprintf("%s/%s", path_a, BUFFER_CAPTURE_FILE);
+	fp_a = fopen(path_a, "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_a);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Open second file */
+	path_b = pg_strdup(argv[2]);
+	canonicalize_path(path_b);
+	path_b = psprintf("%s/%s", path_b, BUFFER_CAPTURE_FILE);
+	fp_b = fopen(path_b, "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_b);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Initialize buffers for first loop */
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* Read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	/* Do comparisons as long as there are entries */
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* Compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecord(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("Lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
+
+#else
+int
+main(int argc, char **argv)
+{
+	return 0;
+}
+#endif /* BUFFER_CAPTURE */
diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
new file mode 100644
index 0000000..5bec503
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+# Default test suite for buffer_compare_cmp
+
+# PGPORT is already set so process should refer to that when kicking tests
+
+# In order to run the regression test suite, copy this file as test-custom.sh
+# and then uncomment the following lines:
+# echo ROOT_DIR=$PWD
+# psql -c 'CREATE DATABASE regression'
+# cd ../../src/test/regress && make installcheck 2>&1 > /dev/null
+# cd $ROOT_DIR
+
+# Create a simple table
+psql -c 'CREATE TABLE aa AS SELECT generate_series(1, 10) AS a'
diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
new file mode 100644
index 0000000..f8e7858
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -0,0 +1,157 @@
+#!/bin/bash
+
+# contrib/buffer_capture_cmp/test.sh
+#
+# Test driver for buffer_capture_cmp. It does the following processing:
+#
+# 1) Initialization of a master and a standby
+# 2) Stop master, then standby
+# 3) Remove $PGDATA/buffer_capture on master and standby
+# 4) Start master, then standby
+# 5) Run custom or default series of tests
+# 6) Stop master, then standby
+# 7) Compare the buffer capture of both nodes
+# 8) The diffence file should be empty
+#
+# Portions Copyright (c) 2006-2014, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+
+# Initialize environment
+. ../../src/test/shell/init_env.sh
+
+# Adjust these paths for your environment
+TESTROOT=$PWD/tmp_check
+TEST_MASTER=$TESTROOT/data_master
+TEST_STANDBY=$TESTROOT/data_standby
+
+# Create the root folder for test data
+if [ ! -d $TESTROOT ]; then
+	mkdir -p $TESTROOT
+fi
+
+export PGDATABASE="postgres"
+
+# Set up PATH correctly
+PATH=$bindir:$PATH
+export PATH
+
+# Get port values for master node
+pg_get_test_port ../..
+PG_MASTER_PORT=$PGPORT
+
+# Enable echo so the user can see what is being executed
+set -x
+
+# Set up the nodes, first the master
+rm -rf $TEST_MASTER
+initdb -N -A trust -D $TEST_MASTER
+
+# Custom parameters for master's postgresql.conf
+cat >> $TEST_MASTER/postgresql.conf <<EOF
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+checkpoint_segments = 50
+shared_buffers = 1MB
+log_line_prefix = 'M  %m %p '
+hot_standby = on
+autovacuum = off
+max_connections = 50
+listen_addresses = '$LISTEN_ADDRESSES'
+port = $PG_MASTER_PORT
+EOF
+
+# Accept replication connections on master
+cat >> $TEST_MASTER/pg_hba.conf <<EOF
+local replication all trust
+host replication all 127.0.0.1/32 trust
+host replication all ::1/128 trust
+EOF
+
+# Start master
+pg_ctl -w -D $TEST_MASTER start
+
+# Now the standby
+echo "Master initialized and running."
+
+# Set up standby with necessary parameters
+rm -rf $TEST_STANDBY
+
+# Base backup is taken with xlog files included
+pg_basebackup -D $TEST_STANDBY -p $PG_MASTER_PORT -x
+
+# Get a fresh port value for the standby
+pg_get_test_port ../..
+PG_STANDBY_PORT=$PGPORT
+echo standby: $PG_STANDBY_PORT master: $PG_MASTER_PORT
+echo "port = $PG_STANDBY_PORT" >> $TEST_STANDBY/postgresql.conf
+
+# Still need to set up PGPORT for subsequent tests
+PGPORT=$PG_MASTER_PORT
+export PGPORT
+
+cat > $TEST_STANDBY/recovery.conf <<EOF
+primary_conninfo='port=$PG_MASTER_PORT'
+standby_mode=on
+recovery_target_timeline='latest'
+EOF
+
+# Start standby
+pg_ctl -w -D $TEST_STANDBY start
+
+# Stop both nodes and remove the file containing the buffer captures
+# Master needs to be stopped first.
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+rm -rf $TEST_MASTER/buffer_captures
+rm -rf $TEST_STANDBY/buffer_captures
+
+# Re-start the nodes
+pg_ctl -w -D $TEST_MASTER start
+pg_ctl -w -D $TEST_STANDBY start
+
+# Check the presence of custom tests and kick them in priority. If not,
+# fallback to the default tests. Tests need only to be run on the master
+# node.
+if [ -f ./test-custom.sh ]; then
+	. ./test-custom.sh
+else
+	. ./test-default.sh
+fi
+
+# Time to stop the nodes as tests have run
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+
+DIFF_FILE=capture_differences.txt
+
+# Now compare the buffer images
+# Disable erroring here, buffer capture file may not be present
+# if cluster has not been built with symbol BUFFER_CAPTURE
+set +e
+./buffer_capture_cmp $TEST_MASTER $TEST_STANDBY > $DIFF_FILE
+ERR_NUM=$?
+
+# Cover the case where capture file does not exist
+if [ $ERR_NUM == 2 ]; then
+	echo "Capture file does not exist"
+	echo "PASSED"
+	exit 0
+elif [ $ERR_NUM == 1 ]; then
+	echo "FAILED"
+	exit 1
+fi
+
+# No need to echo commands anymore
+set +x
+echo
+
+# Test passes if there are no diffs!
+if [ ! -s $DIFF_FILE ]; then
+	echo "PASSED"
+    exit 0
+else
+	echo "Diffs exist in the buffer captures"
+	echo "FAILED"
+	exit 1
+fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b77c32c..8f5a450 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,9 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -7002,6 +7005,14 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+#ifdef BUFFER_CAPTURE
+	/*
+	 * The normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	buffer_capture_write(page, blkno);
+#endif
+
 	return recptr;
 }
 
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..6ec85b0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufcapt.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufcapt.c b/src/backend/storage/buffer/bufcapt.c
new file mode 100644
index 0000000..9e5dfc1
--- /dev/null
+++ b/src/backend/storage/buffer/bufcapt.c
@@ -0,0 +1,482 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.c
+ *	  Routines for buffer capture, including masking and dynamic buffer
+ *	  snapshot.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufcapt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufcapt.h"
+#include "storage/bufmgr.h"
+#include "storage/lwlock.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER				0xFF
+
+/* Support for capturing changes to pages per process */
+#define MAX_BEFORE_IMAGES		100
+
+typedef struct
+{
+	Buffer		  buffer;
+	char			content[BLCKSZ];
+} BufferImage;
+
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+static FILE *imagefp;
+static int before_images_cnt = 0;
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ */
+static void mask_unused_space(char *page);
+static void mask_heap_page(char *page);
+static void mask_spgist_page(char *page);
+static void mask_gist_page(char *page);
+static void mask_gin_page(BlockNumber blkno, char *page);
+static void mask_sequence_page(char *page);
+static void mask_btree_page(char *page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page)
+{
+	int		 pd_lower = ((PageHeader) page)->pd_lower;
+	int		 pd_upper = ((PageHeader) page)->pd_upper;
+	int		 pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(char *page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	  iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int	 len = ItemIdGetLength(iid);
+			int	 padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(BlockNumber blkno, char *page)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(char *page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(char *page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+		(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be better with more refinement.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/* ----------------------------------------------------------------
+ * Buffer capture functions.
+ *
+ * Those functions can be used to memmorize the content of pages
+ * and flush them to BUFFER_CAPTURE_FILE when necessary.
+ */
+static bool
+buffer_capture_is_changed(BufferImage *img)
+{
+	Page			newcontent = BufferGetPage(img->buffer);
+	Page			oldcontent = (Page) img->content;
+
+	if (PageGetLSN(oldcontent) == PageGetLSN(newcontent))
+		return false;
+	return true;
+}
+
+/*
+ * Initialize page capture
+ */
+void
+buffer_capture_init(void)
+{
+	int				i;
+	BufferImage	   *images;
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen(BUFFER_CAPTURE_FILE, "ab");
+}
+
+/*
+ * buffer_capture_reset
+ *
+ * Reset buffer captures
+ */
+void
+buffer_capture_reset(void)
+{
+	if (before_images_cnt > 0)
+		elog(LOG, "Released all buffer captures");
+	before_images_cnt = 0;
+}
+
+/*
+ * buffer_capture_write
+ *
+ * Flush to file the new content page present here after applying a
+ * mask on it.
+ */
+void
+buffer_capture_write(char *newcontent,
+					 uint32 blkno)
+{
+	XLogRecPtr	newlsn = PageGetLSN((Page) newcontent);
+	char		page[BLCKSZ];
+	uint16		tail;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	/* Copy content of page before any operation */
+	memcpy(page, newcontent, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		mask_heap_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)))
+	{
+		/*
+		 * It happens that btree and gist have the same size of special
+		 * area.
+		 */
+		if (tail == GIST_PAGE_ID)
+			mask_gist_page(page);
+		else if (tail <= MAX_BT_CYCLE_ID)
+			mask_btree_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == 8)
+	{
+		/*
+		 * XXX: Page detection for sequences can be improved.
+		 */
+		if (tail == SPGIST_PAGE_ID)
+			mask_spgist_page(page);
+		else if (*((uint32 *) (page + BLCKSZ - MAXALIGN(sizeof(uint32)))) == SEQ_MAGIC)
+			mask_sequence_page(page);
+		else
+			mask_gin_page(blkno, page);
+	}
+
+	/*
+	 * First write the LSN in a constant format to facilitate comparisons
+	 * between buffer captures.
+	 */
+	fprintf(imagefp, "LSN: %08X/%08X ",
+			(uint32) (newlsn >> 32), (uint32) newlsn);
+
+	/* Then write the page contents, in hex */
+	fprintf(imagefp, "page: ");
+	{
+		char	buf[BLCKSZ * 2];
+		int     j = 0;
+		int		i;
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte = (uint8) page[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+	fprintf(imagefp, "\n");
+
+	/* Then the masked page in hex format */
+	fflush(imagefp);
+
+	/* Clean up */
+	LWLockRelease(SyncScanLock);
+}
+
+/*
+ * buffer_capture_remember
+ *
+ * Append a page content to the existing list of buffers to-be-captured.
+ */
+void
+buffer_capture_remember(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy(img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * buffer_capture_forget
+ *
+ * Forget a page image. If the page was modified, log the new contents.
+ */
+void
+buffer_capture_forget(Buffer buffer)
+{
+	int	i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+			/* If page has new content, capture it */
+			if (buffer_capture_is_changed(img))
+			{
+				Page content = BufferGetPage(img->buffer);
+				RelFileNode	rnode;
+				ForkNumber	forknum;
+				BlockNumber	blkno;
+
+				BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+				buffer_capture_write(content, blkno);
+			}
+
+			if (i != before_images_cnt)
+			{
+				/* Swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+
+			before_images_cnt--;
+			return;
+		}
+	}
+	elog(LOG, "could not find image for buffer %u", buffer);
+}
+
+/*
+ * buffer_capture_scan
+ *
+ * See if any of the buffers that have been memorized have changed.
+ */
+void
+buffer_capture_scan(void)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (buffer_capture_is_changed(img))
+		{
+			Page content = BufferGetPage(img->buffer);
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+
+			BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+			buffer_capture_write(content, blkno);
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
+
+#endif /* BUFFER_CAPTURE */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..b1b2467 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -50,6 +50,9 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -1728,6 +1731,10 @@ AtEOXact_Buffers(bool isCommit)
 	}
 #endif
 
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1744,6 +1751,10 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef BUFFER_CAPTURE
+	buffer_capture_init();
+#endif
 }
 
 /*
@@ -2759,6 +2770,20 @@ LockBuffer(Buffer buffer, int mode)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * XXX: peek into the LWLock struct to see if we're holding it in
+		 * exclusive or shared mode. This is concurrency-safe: if we're holding
+		 * it in exclusive mode, no-one else can release it. If we're holding
+		 * it in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			buffer_capture_forget(buffer);
+	}
+#endif
+
 	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
@@ -2767,6 +2792,11 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		buffer_capture_remember(buffer);
+#endif
 }
 
 /*
@@ -2778,6 +2808,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	res;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2785,7 +2816,14 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	res = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+
+#ifdef BUFFER_CAPTURE
+	if (res)
+		buffer_capture_remember(buffer);
+#endif
+
+	return res;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d23ac62..32762a6 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -41,6 +41,10 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
+
 #ifdef LWLOCK_STATS
 #include "utils/hsearch.h"
 #endif
@@ -1240,6 +1244,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..6beaa15 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -21,6 +21,9 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		ignore_checksum_failure = false;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..1ae98f7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -116,11 +116,24 @@ do { \
 
 #define START_CRIT_SECTION()  (CritSectionCount++)
 
+#ifdef BUFFER_CAPTURE
+/* in src/backend/storage/buffer/bufcapt.c */
+void buffer_capture_scan(void);
+
+#define END_CRIT_SECTION() \
+do { \
+	Assert(CritSectionCount > 0); \
+	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		buffer_capture_scan(); \
+} while(0)
+#else
 #define END_CRIT_SECTION() \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
 } while(0)
+#endif
 
 
 /*****************************************************************************
diff --git a/src/include/storage/bufcapt.h b/src/include/storage/bufcapt.h
new file mode 100644
index 0000000..089e5a7
--- /dev/null
+++ b/src/include/storage/bufcapt.h
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcaptr.h
+ *	  Buffer capture definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufcapt.h
+ *
+ * About BUFFER_CAPTURE:
+ *
+ * If this symbol is defined, all page images that are logged on this
+ * server are as well flushed to BUFFER_CAPTURE_FILE. One line of the
+ * output file is used for a single page image.
+ *
+ * The page images obtained are aimed to be used with the utility tool
+ * called buffer_capture_cmp available in contrib/ and can be used to
+ * compare how WAL is replayed between master and standby nodes, helping
+ * in spotting bugs in this area. As each page is written in hexadecimal
+ * format, one single page writes BLCKSZ * 2 bytes to the capture file.
+ * Hexadecimal format makes it easier to spot differences between captures
+ * done among nodes. Be aware that this has a large impact on I/O and that
+ * it is aimed only for buildfarm and development purposes.
+ *
+ * One single page entry has the following format:
+ *	LSN: %08X/08X page: PAGE_IN_HEXA
+ *
+ * The LSN corresponds to the log sequence number to which the page image
+ * applies to, then the content of the image is added as-is. This format
+ * is chosen to facilitate comparisons between each capture entry
+ * particularly in cases where LSN increases its digit number.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFCAPT_H
+#define BUFCAPT_H
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Output file where buffer captures are stored
+ */
+#define BUFFER_CAPTURE_FILE "buffer_captures"
+
+void buffer_capture_init(void);
+void buffer_capture_reset(void);
+void buffer_capture_write(char *newcontent,
+						  uint32 blkno);
+
+void buffer_capture_remember(Buffer buffer);
+void buffer_capture_forget(Buffer buffer);
+void buffer_capture_scan(void);
+
+#endif /* BUFFER_CAPTURE */
+
+#endif
-- 
2.0.0

#17

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Michael Paquier (#11)

Re: WAL replay bugs

On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

And here is the tool itself. It consists of two parts:

1. Modifications to the backend to write the page images
2. A post-processing tool to compare the logged images between master and
standby.

Having that into Postgres at the disposition of developers would be
great, and I believe that it would greatly reduce the occurrence of
bugs caused by WAL replay during recovery. So, with the permission of
the author, I have been looking at this facility for a cleaner
integration into Postgres.

I'm not sure if this is reasonably possible, but one thing that would
make this tool a whole lot easier to use would be if you could make
all the magic happen in a single server. For example, suppose you had
a background process that somehow got access to the pre and post
images for every buffer change, and the associated WAL record, and
tried applying the WAL record to the pre-image to see whether it got
the corresponding post-image. Then you could run 'make check' or so
and afterwards do something like psql -c 'SELECT * FROM
wal_replay_problems()' and hopefully get no rows back.

Don't get me wrong, having this tool at all sounds great. But I think
to really get the full benefit out of it we need to be able to run it
in the buildfarm, so that if people break stuff it gets noticed
quickly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Robert Haas (#17)

Re: WAL replay bugs

On Wed, Jun 18, 2014 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
I'm not sure if this is reasonably possible, but one thing that would
make this tool a whole lot easier to use would be if you could make
all the magic happen in a single server. For example, suppose you had
a background process that somehow got access to the pre and post
images for every buffer change, and the associated WAL record, and
tried applying the WAL record to the pre-image to see whether it got
the corresponding post-image. Then you could run 'make check' or so
and afterwards do something like psql -c 'SELECT * FROM
wal_replay_problems()' and hopefully get no rows back.

So your point is to have a 3rd independent server in the process that
would compare images taken from a master and its standby? Seems to
complicate the machinery.

Don't get me wrong, having this tool at all sounds great. But I think
to really get the full benefit out of it we need to be able to run it
in the buildfarm, so that if people break stuff it gets noticed
quickly.

The patch I sent has included a regression test suite making the tests
rather facilitated: that's only a matter of running actually "make
check" in the contrib repository containing the binary able to compare
buffer captures between a master and a standby.

Thanks,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Michael Paquier (#18)

Re: WAL replay bugs

On Tue, Jun 17, 2014 at 5:40 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jun 18, 2014 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
I'm not sure if this is reasonably possible, but one thing that would
make this tool a whole lot easier to use would be if you could make
all the magic happen in a single server. For example, suppose you had
a background process that somehow got access to the pre and post
images for every buffer change, and the associated WAL record, and
tried applying the WAL record to the pre-image to see whether it got
the corresponding post-image. Then you could run 'make check' or so
and afterwards do something like psql -c 'SELECT * FROM
wal_replay_problems()' and hopefully get no rows back.

So your point is to have a 3rd independent server in the process that
would compare images taken from a master and its standby? Seems to
complicate the machinery.

No, I was trying to get it down form 2 servers to 1, not 2 servers up to 3.

Don't get me wrong, having this tool at all sounds great. But I think
to really get the full benefit out of it we need to be able to run it
in the buildfarm, so that if people break stuff it gets noticed
quickly.

The patch I sent has included a regression test suite making the tests
rather facilitated: that's only a matter of running actually "make
check" in the contrib repository containing the binary able to compare
buffer captures between a master and a standby.

Cool!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#16)

3 attachment(s)

Re: WAL replay bugs

On Mon, Jun 16, 2014 at 12:19 PM, Michael Paquier <michael.paquier@gmail.com

wrote:

On Fri, Jun 13, 2014 at 4:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

A couple of things to note though:
1) In order to detect if a page is used for a sequence, SEQ_MAGIC
needs to be exposed in sequence.h. This is included in the patch
attached but perhaps this should be changed as a separate patch
2) Regression test facility uses some useful parts taken from
pg_upgrade. I think that we should gather those parts in a common
place (contrib/common?). This can facilitate the integration of other
modules using regression based on bash scripts.
3) While hacking this facility, I noticed that some ItemId entries in
btree pages could be inconsistent between master and standby. Those
items are masked in the current patch, but it looks like a bug of
Postgres itself.

Attached are 3 patches doing exactly this separation for lisibility.

Here are rebased patches, their was a conflict with a recent commit in
contrib/pg_upgrade.
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-to-sequence.h.patchtext/x-diff; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-to-sequence.h.patchDownload

From 955aa03df3b7586d8367627ca841938f37010d79 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 16 Jun 2014 10:38:57 +0900
Subject: [PATCH 1/3] Move SEQ_MAGIC to sequence.h

This can allow a backend process to detect if a page is being used
for a sequence.
---
 src/backend/commands/sequence.c | 5 -----
 src/include/commands/sequence.h | 4 ++++
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..2134eae 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -45,11 +45,6 @@
  */
 #define SEQ_LOG_VALS	32
 
-/*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
 typedef struct sequence_magic
 {
 	uint32		magic;
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 8819c00..3a69580 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,10 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * "special area" identifier of a sequence's buffer page
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
-- 
2.0.0

0002-Extract-generic-bash-initialization-process-from-pg_.patchtext/x-diff; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_.patchDownload

From 85a9a7961be6fa19d04d0c654c3b3322452e1bb0 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 16 Jun 2014 11:54:59 +0900
Subject: [PATCH 2/3] Extract generic bash initialization process from
 pg_upgrade

Such initialization is useful as well for some other utilities and makes
test settings consistent.
---
 contrib/pg_upgrade/test.sh | 19 ++++++---------
 src/test/shell/init_env.sh | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 12 deletions(-)
 create mode 100644 src/test/shell/init_env.sh

diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 7bbd2c7..9184035 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -9,24 +9,14 @@
 # Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 
-set -e
-
-: ${MAKE=make}
-
-# Guard against parallel make issues (see comments in pg_regress.c)
-unset MAKEFLAGS
-unset MAKELEVEL
-
-# Establish how the server will listen for connections
-testhost=`uname -s`
+# Initialize test
+. ../../src/test/shell/init_env.sh
 
 case $testhost in
 	MINGW*)
-		LISTEN_ADDRESSES="localhost"
 		PGHOST=""; unset PGHOST
 		;;
 	*)
-		LISTEN_ADDRESSES=""
 		# Select a socket directory.  The algorithm is from the "configure"
 		# script; the outcome mimics pg_regress.c:make_temp_sockdir().
 		PGHOST=$PG_REGRESS_SOCK_DIR
@@ -102,6 +92,7 @@ logdir=$PWD/log
 rm -rf "$logdir"
 mkdir "$logdir"
 
+<<<<<<< HEAD
 # Clear out any environment vars that might cause libpq to connect to
 # the wrong postmaster (cf pg_regress.c)
 #
@@ -133,6 +124,10 @@ do
 	PGPORT=`expr $PGPORT + 1`
 	export PGPORT
 done
+=======
+# Get a port to run the tests
+pg_get_test_port $newsrc
+>>>>>>> Extract generic bash initialization process from pg_upgrade
 
 # buildfarm may try to override port via EXTRA_REGRESS_OPTS ...
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --port=$PGPORT"
diff --git a/src/test/shell/init_env.sh b/src/test/shell/init_env.sh
new file mode 100644
index 0000000..d37eb69
--- /dev/null
+++ b/src/test/shell/init_env.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+
+# src/test/shell/init.sh
+#
+# Utility initializing environment for tests to be conducted in shell.
+# The initialization done here is made to be platform-proof.
+
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*)	LISTEN_ADDRESSES="localhost" ;;
+	*)		LISTEN_ADDRESSES="" ;;
+esac
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return nonzero from unset if the variable
+# is already unset. Since we are operating under 'set e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+# Select a non-conflicting port number, similarly to pg_regress.c, and
+# save its value in PGPORT. Caller should either save or use this value
+# for the tests.
+pg_get_test_port()
+{
+	PG_ROOT_DIR=$1
+	PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$PG_ROOT_DIR"/src/include/pg_config.h | awk '{print $3}'`
+	PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+	export PGPORT
+
+	i=0
+	while psql -X postgres </dev/null 2>/dev/null
+	do
+		i=`expr $i + 1`
+		if [ $i -eq 16 ]
+		then
+			echo port $PGPORT apparently in use
+			exit 1
+		fi
+		PGPORT=`expr $PGPORT + 1`
+		export PGPORT
+	done
+}
-- 
2.0.0

0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/x-diff; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload

From 247c669e5ff3ad3fae429b8857f24404ed9d4ffd Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 16 Jun 2014 12:15:06 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines that
can be used to check for consistency at page level when replaying WAL
files among several nodes of a cluster (generally master and standby
node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then each
buffer is captured is with the following format as a single line of
the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between pages,
and format is chosen to facilitate comparison between buffer entries.
- A client part, located in contrib/buffer_capture_cmp, that can be used
to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a CFLAGS
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in test-default.sh
but user is free to set up custom tests by creating a file called
test-custom.sh that can be kicked by the test facility if this file
is present instead of the defaults.
---
 contrib/Makefile                                |   1 +
 contrib/buffer_capture_cmp/.gitignore           |   9 +
 contrib/buffer_capture_cmp/Makefile             |  32 ++
 contrib/buffer_capture_cmp/README               |  34 ++
 contrib/buffer_capture_cmp/buffer_capture_cmp.c | 346 +++++++++++++++++
 contrib/buffer_capture_cmp/test-default.sh      |  14 +
 contrib/buffer_capture_cmp/test.sh              | 157 ++++++++
 src/backend/access/heap/heapam.c                |  11 +
 src/backend/storage/buffer/Makefile             |   2 +-
 src/backend/storage/buffer/bufcapt.c            | 482 ++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c             |  40 +-
 src/backend/storage/lmgr/lwlock.c               |   8 +
 src/backend/storage/page/bufpage.c              |   3 +
 src/include/miscadmin.h                         |  13 +
 src/include/storage/bufcapt.h                   |  65 ++++
 15 files changed, 1215 insertions(+), 2 deletions(-)
 create mode 100644 contrib/buffer_capture_cmp/.gitignore
 create mode 100644 contrib/buffer_capture_cmp/Makefile
 create mode 100644 contrib/buffer_capture_cmp/README
 create mode 100644 contrib/buffer_capture_cmp/buffer_capture_cmp.c
 create mode 100644 contrib/buffer_capture_cmp/test-default.sh
 create mode 100644 contrib/buffer_capture_cmp/test.sh
 create mode 100644 src/backend/storage/buffer/bufcapt.c
 create mode 100644 src/include/storage/bufcapt.h

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..1c8e6b9 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		btree_gin	\
 		btree_gist	\
+		buffer_capture_cmp \
 		chkpass		\
 		citext		\
 		cube		\
diff --git a/contrib/buffer_capture_cmp/.gitignore b/contrib/buffer_capture_cmp/.gitignore
new file mode 100644
index 0000000..ecd8b78
--- /dev/null
+++ b/contrib/buffer_capture_cmp/.gitignore
@@ -0,0 +1,9 @@
+# Binary generated
+/buffer_capture_cmp
+
+# Regression tests
+/capture_differences.txt
+/tmp_check
+
+# Custom test file
+/test-custom.sh
diff --git a/contrib/buffer_capture_cmp/Makefile b/contrib/buffer_capture_cmp/Makefile
new file mode 100644
index 0000000..da4316b
--- /dev/null
+++ b/contrib/buffer_capture_cmp/Makefile
@@ -0,0 +1,32 @@
+# contrib/buffer_capture_cmp/Makefile
+
+PGFILEDESC = "buffer_capture_cmp - Comparator tool between buffer captures"
+PGAPPICON = win32
+
+PROGRAM = buffer_capture_cmp
+OBJS	= buffer_capture_cmp.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir) -DFRONTEND
+PG_LIBS = $(libpq_pgport)
+
+EXTRA_CLEAN = tmp_check/ capture_differences.txt
+
+# test.sh creates a cluster dedicated for the test
+EXTRA_REGRESS_OPTS=--use-existing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/buffer_capture_cmp
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# Tests can be done only when BUFFER_CAPTURE is defined
+ifneq (,$(filter -DBUFFER_CAPTURE,$(CFLAGS)))
+check: test.sh all
+	MAKE=$(MAKE) bindir=$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) ./test.sh
+endif
diff --git a/contrib/buffer_capture_cmp/README b/contrib/buffer_capture_cmp/README
new file mode 100644
index 0000000..1039c69
--- /dev/null
+++ b/contrib/buffer_capture_cmp/README
@@ -0,0 +1,34 @@
+buffer_capture_cmp
+------------------
+
+This facility contains already contains a set of regression tests that
+can run be by default. This simple command is enough to run the tests:
+
+    make check
+
+The code contains a hook that can be used as an entry point to run some
+custom tests using this facility. Simply create in this folder a file
+called test-custom.sh and execute all the commands necessary for the
+tests. This script can use the node number of the master node available
+as the first argument of the script when it is run within the test
+suite.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+It is possible to use pg_xlogdump to see which WAL record a page image
+corresponds to. But beware that the LSN in the page image points to the
+*end* of the WAL record, while the LSN that pg_xlogdump prints is the
+*beginning* of the WAL record. So to find which WAL record a page image
+corresponds to, find the LSN from the page image in pg_xlogdump output,
+and back off one record. (you can't just grep for the line containing
+the LSN).
diff --git a/contrib/buffer_capture_cmp/buffer_capture_cmp.c b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
new file mode 100644
index 0000000..2695597
--- /dev/null
+++ b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
@@ -0,0 +1,346 @@
+/*-------------------------------------------------------------------------
+ *
+ * buffer_capture_cmp.c
+ *	  Utility tool to compare buffer captures between two nodes of
+ *	  a cluster. This utility needs to be run on servers whose code
+ *	  has been built with the symbol BUFFER_CAPTURE defined.
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    contrib/buffer_capture_cmp/buffer_capture_cmp.c
+ *
+ * Capture files that can be obtained on nodes of a cluster do not
+ * necessarily have the page images logged in the same order. For
+ * example, if a single WAL-logged operation modifies multiple pages,
+ * like an index page split, the standby might release the locks
+ * in different order than the master. Another cause is concurrent
+ * operations; writing the page images is not atomic with WAL insertion,
+ * so if two backends are running concurrently, their modifications in
+ * the image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines from the capture file are read into a reorder
+ * buffer, and sorted there. Sorting the whole file would be overkill,
+ * as the lines are mostly in order already. The fixed-size reorder
+ * buffer works as long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ *
+ * If symbol BUFFER_CAPTURE is not defined, this utility does nothing.
+ */
+
+#include "postgres_fe.h"
+#include "port.h"
+#include "storage/bufcapt.h"
+
+#ifdef BUFFER_CAPTURE
+
+/* Size of a single entry of the capture file */
+#define LINESZ (BLCKSZ*2 + 31)
+
+/* Line reordering stuff */
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	   *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* Number of lines currently in buffer */
+
+	FILE	   *fp;
+	bool		eof;		/* Has EOF been reached from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the capture file into the reorder buffer, until the
+ * buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+
+		/*
+		 * Common case, entry goes to the end. This happens for an
+		 * initialization or when buffer compares to be higher than
+		 * the last buffer in queue.
+		 */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* Find the right place in the queue */
+			int			i;
+
+			/*
+			 * Scan all the existing buffers and find where buffer needs
+			 * to be included. We already know that the comparison result
+			 * we the last buffer in list.
+			 */
+			for (i = buf->nlines - 1; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+
+			/* Place buffer correctly in the list */
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+/*
+ * Initialize a reorder buffer.
+ */
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		/* Fetch LSN from the first 8 bytes of the buffer */
+		sscanf(buf->lines[nlines], "LSN: %08X/%08X",
+			   &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* Consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+	return nlines;
+}
+
+/*
+ * Free all the given records.
+ */
+static void
+freerecord(char **lines, int nlines)
+{
+	int                     i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+/*
+ * Print out given records.
+ */
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+/*
+ * Do a direct comparison between the two given records that have the
+ * same LSN entry.
+ */
+static bool
+diffrecord(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	/* Leave if they do not have the same number of records */
+	if (nlines_a != nlines_b)
+		return true;
+
+	/*
+	 * Now do a comparison line by line. If any diffs are found at
+	 * character-level they will be printed out.
+	 */
+	for (i = 0; i < nlines_a; i++)
+	{
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			int strlen_a = strlen(lines_a[i]);
+			int strlen_b = strlen(lines_b[i]);
+			int j;
+
+			printf("strlen_a: %d, strlen_b: %d\n",
+				   strlen_a, strlen_b);
+			for (j = 0; j < strlen_a; j++)
+			{
+				char char_a = lines_a[i][j];
+				char char_b = lines_b[i][j];
+				if (char_a != char_b)
+					printf("position: %d, char_a: %c, char_b: %c\n",
+						   j, char_a, char_b);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's data folder> <standby's data folder>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	char	   *path_a, *path_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	/* Open first file */
+	path_a = pg_strdup(argv[1]);
+	canonicalize_path(path_a);
+	path_a = psprintf("%s/%s", path_a, BUFFER_CAPTURE_FILE);
+	fp_a = fopen(path_a, "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_a);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Open second file */
+	path_b = pg_strdup(argv[2]);
+	canonicalize_path(path_b);
+	path_b = psprintf("%s/%s", path_b, BUFFER_CAPTURE_FILE);
+	fp_b = fopen(path_b, "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_b);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Initialize buffers for first loop */
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* Read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	/* Do comparisons as long as there are entries */
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* Compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecord(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("Lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
+
+#else
+int
+main(int argc, char **argv)
+{
+	return 0;
+}
+#endif /* BUFFER_CAPTURE */
diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
new file mode 100644
index 0000000..5bec503
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+# Default test suite for buffer_compare_cmp
+
+# PGPORT is already set so process should refer to that when kicking tests
+
+# In order to run the regression test suite, copy this file as test-custom.sh
+# and then uncomment the following lines:
+# echo ROOT_DIR=$PWD
+# psql -c 'CREATE DATABASE regression'
+# cd ../../src/test/regress && make installcheck 2>&1 > /dev/null
+# cd $ROOT_DIR
+
+# Create a simple table
+psql -c 'CREATE TABLE aa AS SELECT generate_series(1, 10) AS a'
diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
new file mode 100644
index 0000000..f8e7858
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -0,0 +1,157 @@
+#!/bin/bash
+
+# contrib/buffer_capture_cmp/test.sh
+#
+# Test driver for buffer_capture_cmp. It does the following processing:
+#
+# 1) Initialization of a master and a standby
+# 2) Stop master, then standby
+# 3) Remove $PGDATA/buffer_capture on master and standby
+# 4) Start master, then standby
+# 5) Run custom or default series of tests
+# 6) Stop master, then standby
+# 7) Compare the buffer capture of both nodes
+# 8) The diffence file should be empty
+#
+# Portions Copyright (c) 2006-2014, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+
+# Initialize environment
+. ../../src/test/shell/init_env.sh
+
+# Adjust these paths for your environment
+TESTROOT=$PWD/tmp_check
+TEST_MASTER=$TESTROOT/data_master
+TEST_STANDBY=$TESTROOT/data_standby
+
+# Create the root folder for test data
+if [ ! -d $TESTROOT ]; then
+	mkdir -p $TESTROOT
+fi
+
+export PGDATABASE="postgres"
+
+# Set up PATH correctly
+PATH=$bindir:$PATH
+export PATH
+
+# Get port values for master node
+pg_get_test_port ../..
+PG_MASTER_PORT=$PGPORT
+
+# Enable echo so the user can see what is being executed
+set -x
+
+# Set up the nodes, first the master
+rm -rf $TEST_MASTER
+initdb -N -A trust -D $TEST_MASTER
+
+# Custom parameters for master's postgresql.conf
+cat >> $TEST_MASTER/postgresql.conf <<EOF
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+checkpoint_segments = 50
+shared_buffers = 1MB
+log_line_prefix = 'M  %m %p '
+hot_standby = on
+autovacuum = off
+max_connections = 50
+listen_addresses = '$LISTEN_ADDRESSES'
+port = $PG_MASTER_PORT
+EOF
+
+# Accept replication connections on master
+cat >> $TEST_MASTER/pg_hba.conf <<EOF
+local replication all trust
+host replication all 127.0.0.1/32 trust
+host replication all ::1/128 trust
+EOF
+
+# Start master
+pg_ctl -w -D $TEST_MASTER start
+
+# Now the standby
+echo "Master initialized and running."
+
+# Set up standby with necessary parameters
+rm -rf $TEST_STANDBY
+
+# Base backup is taken with xlog files included
+pg_basebackup -D $TEST_STANDBY -p $PG_MASTER_PORT -x
+
+# Get a fresh port value for the standby
+pg_get_test_port ../..
+PG_STANDBY_PORT=$PGPORT
+echo standby: $PG_STANDBY_PORT master: $PG_MASTER_PORT
+echo "port = $PG_STANDBY_PORT" >> $TEST_STANDBY/postgresql.conf
+
+# Still need to set up PGPORT for subsequent tests
+PGPORT=$PG_MASTER_PORT
+export PGPORT
+
+cat > $TEST_STANDBY/recovery.conf <<EOF
+primary_conninfo='port=$PG_MASTER_PORT'
+standby_mode=on
+recovery_target_timeline='latest'
+EOF
+
+# Start standby
+pg_ctl -w -D $TEST_STANDBY start
+
+# Stop both nodes and remove the file containing the buffer captures
+# Master needs to be stopped first.
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+rm -rf $TEST_MASTER/buffer_captures
+rm -rf $TEST_STANDBY/buffer_captures
+
+# Re-start the nodes
+pg_ctl -w -D $TEST_MASTER start
+pg_ctl -w -D $TEST_STANDBY start
+
+# Check the presence of custom tests and kick them in priority. If not,
+# fallback to the default tests. Tests need only to be run on the master
+# node.
+if [ -f ./test-custom.sh ]; then
+	. ./test-custom.sh
+else
+	. ./test-default.sh
+fi
+
+# Time to stop the nodes as tests have run
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+
+DIFF_FILE=capture_differences.txt
+
+# Now compare the buffer images
+# Disable erroring here, buffer capture file may not be present
+# if cluster has not been built with symbol BUFFER_CAPTURE
+set +e
+./buffer_capture_cmp $TEST_MASTER $TEST_STANDBY > $DIFF_FILE
+ERR_NUM=$?
+
+# Cover the case where capture file does not exist
+if [ $ERR_NUM == 2 ]; then
+	echo "Capture file does not exist"
+	echo "PASSED"
+	exit 0
+elif [ $ERR_NUM == 1 ]; then
+	echo "FAILED"
+	exit 1
+fi
+
+# No need to echo commands anymore
+set +x
+echo
+
+# Test passes if there are no diffs!
+if [ ! -s $DIFF_FILE ]; then
+	echo "PASSED"
+    exit 0
+else
+	echo "Diffs exist in the buffer captures"
+	echo "FAILED"
+	exit 1
+fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f8bed19..7555129 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,9 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -7004,6 +7007,14 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+#ifdef BUFFER_CAPTURE
+	/*
+	 * The normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	buffer_capture_write(page, blkno);
+#endif
+
 	return recptr;
 }
 
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..6ec85b0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufcapt.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufcapt.c b/src/backend/storage/buffer/bufcapt.c
new file mode 100644
index 0000000..9e5dfc1
--- /dev/null
+++ b/src/backend/storage/buffer/bufcapt.c
@@ -0,0 +1,482 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.c
+ *	  Routines for buffer capture, including masking and dynamic buffer
+ *	  snapshot.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufcapt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufcapt.h"
+#include "storage/bufmgr.h"
+#include "storage/lwlock.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER				0xFF
+
+/* Support for capturing changes to pages per process */
+#define MAX_BEFORE_IMAGES		100
+
+typedef struct
+{
+	Buffer		  buffer;
+	char			content[BLCKSZ];
+} BufferImage;
+
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+static FILE *imagefp;
+static int before_images_cnt = 0;
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ */
+static void mask_unused_space(char *page);
+static void mask_heap_page(char *page);
+static void mask_spgist_page(char *page);
+static void mask_gist_page(char *page);
+static void mask_gin_page(BlockNumber blkno, char *page);
+static void mask_sequence_page(char *page);
+static void mask_btree_page(char *page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page)
+{
+	int		 pd_lower = ((PageHeader) page)->pd_lower;
+	int		 pd_upper = ((PageHeader) page)->pd_upper;
+	int		 pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(char *page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	  iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int	 len = ItemIdGetLength(iid);
+			int	 padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(BlockNumber blkno, char *page)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(char *page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(char *page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+		(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be better with more refinement.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/* ----------------------------------------------------------------
+ * Buffer capture functions.
+ *
+ * Those functions can be used to memmorize the content of pages
+ * and flush them to BUFFER_CAPTURE_FILE when necessary.
+ */
+static bool
+buffer_capture_is_changed(BufferImage *img)
+{
+	Page			newcontent = BufferGetPage(img->buffer);
+	Page			oldcontent = (Page) img->content;
+
+	if (PageGetLSN(oldcontent) == PageGetLSN(newcontent))
+		return false;
+	return true;
+}
+
+/*
+ * Initialize page capture
+ */
+void
+buffer_capture_init(void)
+{
+	int				i;
+	BufferImage	   *images;
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen(BUFFER_CAPTURE_FILE, "ab");
+}
+
+/*
+ * buffer_capture_reset
+ *
+ * Reset buffer captures
+ */
+void
+buffer_capture_reset(void)
+{
+	if (before_images_cnt > 0)
+		elog(LOG, "Released all buffer captures");
+	before_images_cnt = 0;
+}
+
+/*
+ * buffer_capture_write
+ *
+ * Flush to file the new content page present here after applying a
+ * mask on it.
+ */
+void
+buffer_capture_write(char *newcontent,
+					 uint32 blkno)
+{
+	XLogRecPtr	newlsn = PageGetLSN((Page) newcontent);
+	char		page[BLCKSZ];
+	uint16		tail;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	/* Copy content of page before any operation */
+	memcpy(page, newcontent, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		mask_heap_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)))
+	{
+		/*
+		 * It happens that btree and gist have the same size of special
+		 * area.
+		 */
+		if (tail == GIST_PAGE_ID)
+			mask_gist_page(page);
+		else if (tail <= MAX_BT_CYCLE_ID)
+			mask_btree_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == 8)
+	{
+		/*
+		 * XXX: Page detection for sequences can be improved.
+		 */
+		if (tail == SPGIST_PAGE_ID)
+			mask_spgist_page(page);
+		else if (*((uint32 *) (page + BLCKSZ - MAXALIGN(sizeof(uint32)))) == SEQ_MAGIC)
+			mask_sequence_page(page);
+		else
+			mask_gin_page(blkno, page);
+	}
+
+	/*
+	 * First write the LSN in a constant format to facilitate comparisons
+	 * between buffer captures.
+	 */
+	fprintf(imagefp, "LSN: %08X/%08X ",
+			(uint32) (newlsn >> 32), (uint32) newlsn);
+
+	/* Then write the page contents, in hex */
+	fprintf(imagefp, "page: ");
+	{
+		char	buf[BLCKSZ * 2];
+		int     j = 0;
+		int		i;
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte = (uint8) page[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+	fprintf(imagefp, "\n");
+
+	/* Then the masked page in hex format */
+	fflush(imagefp);
+
+	/* Clean up */
+	LWLockRelease(SyncScanLock);
+}
+
+/*
+ * buffer_capture_remember
+ *
+ * Append a page content to the existing list of buffers to-be-captured.
+ */
+void
+buffer_capture_remember(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy(img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * buffer_capture_forget
+ *
+ * Forget a page image. If the page was modified, log the new contents.
+ */
+void
+buffer_capture_forget(Buffer buffer)
+{
+	int	i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+			/* If page has new content, capture it */
+			if (buffer_capture_is_changed(img))
+			{
+				Page content = BufferGetPage(img->buffer);
+				RelFileNode	rnode;
+				ForkNumber	forknum;
+				BlockNumber	blkno;
+
+				BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+				buffer_capture_write(content, blkno);
+			}
+
+			if (i != before_images_cnt)
+			{
+				/* Swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+
+			before_images_cnt--;
+			return;
+		}
+	}
+	elog(LOG, "could not find image for buffer %u", buffer);
+}
+
+/*
+ * buffer_capture_scan
+ *
+ * See if any of the buffers that have been memorized have changed.
+ */
+void
+buffer_capture_scan(void)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (buffer_capture_is_changed(img))
+		{
+			Page content = BufferGetPage(img->buffer);
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+
+			BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+			buffer_capture_write(content, blkno);
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
+
+#endif /* BUFFER_CAPTURE */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..b1dd4ce 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -50,6 +50,9 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -1708,6 +1711,10 @@ AtEOXact_Buffers(bool isCommit)
 {
 	CheckForBufferLeaks();
 
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1724,6 +1731,10 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef BUFFER_CAPTURE
+	buffer_capture_init();
+#endif
 }
 
 /*
@@ -2749,6 +2760,20 @@ LockBuffer(Buffer buffer, int mode)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * XXX: peek into the LWLock struct to see if we're holding it in
+		 * exclusive or shared mode. This is concurrency-safe: if we're holding
+		 * it in exclusive mode, no-one else can release it. If we're holding
+		 * it in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			buffer_capture_forget(buffer);
+	}
+#endif
+
 	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
@@ -2757,6 +2782,11 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		buffer_capture_remember(buffer);
+#endif
 }
 
 /*
@@ -2768,6 +2798,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	res;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2775,7 +2806,14 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	res = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+
+#ifdef BUFFER_CAPTURE
+	if (res)
+		buffer_capture_remember(buffer);
+#endif
+
+	return res;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d23ac62..32762a6 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -41,6 +41,10 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
+
 #ifdef LWLOCK_STATS
 #include "utils/hsearch.h"
 #endif
@@ -1240,6 +1244,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..6beaa15 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -21,6 +21,9 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		ignore_checksum_failure = false;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..1ae98f7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -116,11 +116,24 @@ do { \
 
 #define START_CRIT_SECTION()  (CritSectionCount++)
 
+#ifdef BUFFER_CAPTURE
+/* in src/backend/storage/buffer/bufcapt.c */
+void buffer_capture_scan(void);
+
+#define END_CRIT_SECTION() \
+do { \
+	Assert(CritSectionCount > 0); \
+	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		buffer_capture_scan(); \
+} while(0)
+#else
 #define END_CRIT_SECTION() \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
 } while(0)
+#endif
 
 
 /*****************************************************************************
diff --git a/src/include/storage/bufcapt.h b/src/include/storage/bufcapt.h
new file mode 100644
index 0000000..089e5a7
--- /dev/null
+++ b/src/include/storage/bufcapt.h
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcaptr.h
+ *	  Buffer capture definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufcapt.h
+ *
+ * About BUFFER_CAPTURE:
+ *
+ * If this symbol is defined, all page images that are logged on this
+ * server are as well flushed to BUFFER_CAPTURE_FILE. One line of the
+ * output file is used for a single page image.
+ *
+ * The page images obtained are aimed to be used with the utility tool
+ * called buffer_capture_cmp available in contrib/ and can be used to
+ * compare how WAL is replayed between master and standby nodes, helping
+ * in spotting bugs in this area. As each page is written in hexadecimal
+ * format, one single page writes BLCKSZ * 2 bytes to the capture file.
+ * Hexadecimal format makes it easier to spot differences between captures
+ * done among nodes. Be aware that this has a large impact on I/O and that
+ * it is aimed only for buildfarm and development purposes.
+ *
+ * One single page entry has the following format:
+ *	LSN: %08X/08X page: PAGE_IN_HEXA
+ *
+ * The LSN corresponds to the log sequence number to which the page image
+ * applies to, then the content of the image is added as-is. This format
+ * is chosen to facilitate comparisons between each capture entry
+ * particularly in cases where LSN increases its digit number.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFCAPT_H
+#define BUFCAPT_H
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Output file where buffer captures are stored
+ */
+#define BUFFER_CAPTURE_FILE "buffer_captures"
+
+void buffer_capture_init(void);
+void buffer_capture_reset(void);
+void buffer_capture_write(char *newcontent,
+						  uint32 blkno);
+
+void buffer_capture_remember(Buffer buffer);
+void buffer_capture_forget(Buffer buffer);
+void buffer_capture_scan(void);
+
+#endif /* BUFFER_CAPTURE */
+
+#endif
-- 
2.0.0

#21

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#20)

1 attachment(s)

Re: WAL replay bugs

On Fri, Jun 27, 2014 at 2:57 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Here are rebased patches, their was a conflict with a recent commit in
contrib/pg_upgrade.

I am resending patch 2 as it contained a rebase conflict not correctly
resolved (Thanks Alvaro).

Regards,
--
Michael

Attachments:

0002-Extract-generic-bash-initialization-process-from-pg_.patchtext/x-diff; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_.patchDownload

From d4f0289ffcece54a78e51e8b707c41e994d549ee Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 27 Jun 2014 23:35:29 +0900
Subject: [PATCH 2/3] Extract generic bash initialization process from
 pg_upgrade

Such initialization is useful as well for some other utilities and makes
test settings consistent.
---
 contrib/pg_upgrade/test.sh | 47 ++++--------------------------------
 src/test/shell/init_env.sh | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 43 deletions(-)
 create mode 100644 src/test/shell/init_env.sh

diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 7bbd2c7..2e1c61a 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -9,24 +9,14 @@
 # Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 
-set -e
-
-: ${MAKE=make}
-
-# Guard against parallel make issues (see comments in pg_regress.c)
-unset MAKEFLAGS
-unset MAKELEVEL
-
-# Establish how the server will listen for connections
-testhost=`uname -s`
+# Initialize test
+. ../../src/test/shell/init_env.sh
 
 case $testhost in
 	MINGW*)
-		LISTEN_ADDRESSES="localhost"
 		PGHOST=""; unset PGHOST
 		;;
 	*)
-		LISTEN_ADDRESSES=""
 		# Select a socket directory.  The algorithm is from the "configure"
 		# script; the outcome mimics pg_regress.c:make_temp_sockdir().
 		PGHOST=$PG_REGRESS_SOCK_DIR
@@ -102,37 +92,8 @@ logdir=$PWD/log
 rm -rf "$logdir"
 mkdir "$logdir"
 
-# Clear out any environment vars that might cause libpq to connect to
-# the wrong postmaster (cf pg_regress.c)
-#
-# Some shells, such as NetBSD's, return non-zero from unset if the variable
-# is already unset. Since we are operating under 'set -e', this causes the
-# script to fail. To guard against this, set them all to an empty string first.
-PGDATABASE="";        unset PGDATABASE
-PGUSER="";            unset PGUSER
-PGSERVICE="";         unset PGSERVICE
-PGSSLMODE="";         unset PGSSLMODE
-PGREQUIRESSL="";      unset PGREQUIRESSL
-PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
-PGHOSTADDR="";        unset PGHOSTADDR
-
-# Select a non-conflicting port number, similarly to pg_regress.c
-PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$newsrc"/src/include/pg_config.h | awk '{print $3}'`
-PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
-export PGPORT
-
-i=0
-while psql -X postgres </dev/null 2>/dev/null
-do
-	i=`expr $i + 1`
-	if [ $i -eq 16 ]
-	then
-		echo port $PGPORT apparently in use
-		exit 1
-	fi
-	PGPORT=`expr $PGPORT + 1`
-	export PGPORT
-done
+# Get a port to run the tests
+pg_get_test_port "$newsrc"
 
 # buildfarm may try to override port via EXTRA_REGRESS_OPTS ...
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --port=$PGPORT"
diff --git a/src/test/shell/init_env.sh b/src/test/shell/init_env.sh
new file mode 100644
index 0000000..d37eb69
--- /dev/null
+++ b/src/test/shell/init_env.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+
+# src/test/shell/init.sh
+#
+# Utility initializing environment for tests to be conducted in shell.
+# The initialization done here is made to be platform-proof.
+
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*)	LISTEN_ADDRESSES="localhost" ;;
+	*)		LISTEN_ADDRESSES="" ;;
+esac
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return nonzero from unset if the variable
+# is already unset. Since we are operating under 'set e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+# Select a non-conflicting port number, similarly to pg_regress.c, and
+# save its value in PGPORT. Caller should either save or use this value
+# for the tests.
+pg_get_test_port()
+{
+	PG_ROOT_DIR=$1
+	PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$PG_ROOT_DIR"/src/include/pg_config.h | awk '{print $3}'`
+	PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+	export PGPORT
+
+	i=0
+	while psql -X postgres </dev/null 2>/dev/null
+	do
+		i=`expr $i + 1`
+		if [ $i -eq 16 ]
+		then
+			echo port $PGPORT apparently in use
+			exit 1
+		fi
+		PGPORT=`expr $PGPORT + 1`
+		export PGPORT
+	done
+}
-- 
2.0.0

#22

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Michael Paquier (#21)

Re: WAL replay bugs

Hello, I had a look on this patch.

Let me show you some comments about the README, Makefile and
buffer_capture_cmp of the second part for the present. A
continuation of this comment would be seen later..

- contrib/buffer_capture_cmp/README

- 'contains' seems duplicate in the first paragraph.

- The second pragraph says that 'This script can use the node
number of the master node available as the first argument of
the script when it is run within the test suite.' But test.sh
seems not giving such a parameter.

- contrib/buffer_capture_cmp/Makefile

"make check" does nothing when BUFFER_CAPTURE is not defined, as
described in itself. But I trapped by that after build the
server by 'make CFLAGS="-DBUFFER_CAPTURE"':( It would be better
that 'make check' without defining it prints some message.

- buffer_capture_cmp.c

This source generates void executable when BUFFER_CAPTURE is
not defined. The restriction seems to be put only to use
BUFFER_CAPTURE_FILE in bufcapt.h. If so, changing the parameter
of the executable as described in the following comment for
main() would blow off the necessity for the restriction.

- buffer_capture_cmp.c/main()

The parameters for this command are the parent directories for
each capture file. This is a bit incovenient for separate
use. For example, when I want to gather the capture files from
multiple servers then compare them, I should unwillingly make
their own directories for each capture file. If no particular
reason exists for the design, I suppose it would be more
convenient that the parameters are the names of the capture
files themselves.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#22)

3 attachment(s)

Re: WAL replay bugs

On Tue, Jul 1, 2014 at 7:25 PM, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I had a look on this patch.

Thanks for your comments. Looking forward to seeing some more input.

- contrib/buffer_capture_cmp/README

- 'contains' seems duplicate in the first paragraph.
- The second paragraph says that 'This script can use the node
number of the master node available as the first argument of
the script when it is run within the test suite.' But test.sh
seems not giving such a parameter.

Yeah right... This was a rest of some previous hacking on this feature.
Paragraph was rather unclear so I rewrote it, mentioning that the custom
script can use PGPORT to connect to the node where tests can be run.

- contrib/buffer_capture_cmp/Makefile

"make check" does nothing when BUFFER_CAPTURE is not defined, as
described in itself. But I trapped by that after build the
server by 'make CFLAGS="-DBUFFER_CAPTURE"':( It would be better
that 'make check' without defining it prints some message.

Sure, I added such a message in the makefile.

- buffer_capture_cmp.c

This source generates void executable when BUFFER_CAPTURE is
not defined. The restriction seems to be put only to use
BUFFER_CAPTURE_FILE in bufcapt.h. If so, changing the parameter
of the executable as described in the following comment for
main() would blow off the necessity for the restriction.

Done. The compilation of this utility is now independent on BUFFER_CAPTURE.
At the same time I made test.sh a bit smarter to have it grab the value of
BUFFER_CAPTURE_FILE directly from bufcapt.h.

- buffer_capture_cmp.c/main()

The parameters for this command are the parent directories for
each capture file. This is a bit inconvenient for separate
use. For example, when I want to gather the capture files from
multiple servers then compare them, I should unwillingly make
their own directories for each capture file. If no particular
reason exists for the design, I suppose it would be more
convenient that the parameters are the names of the capture
files themselves.

Fixed. I changed back the utility to directly file names instead of data
folders as arguments.

Updated patches addressing those comments are attached.
Regards,
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-to-sequence.h.patchtext/x-diff; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-to-sequence.h.patchDownload

From 71f5ddbcd4f691ffdb5237ba9ced51152cd0c43e Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 16 Jun 2014 10:38:57 +0900
Subject: [PATCH 1/3] Move SEQ_MAGIC to sequence.h

This can allow a backend process to detect if a page is being used
for a sequence.
---
 src/backend/commands/sequence.c | 5 -----
 src/include/commands/sequence.h | 4 ++++
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..2134eae 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -45,11 +45,6 @@
  */
 #define SEQ_LOG_VALS	32
 
-/*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
 typedef struct sequence_magic
 {
 	uint32		magic;
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 8819c00..3a69580 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,10 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * "special area" identifier of a sequence's buffer page
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
-- 
2.0.0

0002-Extract-generic-bash-initialization-process-from-pg_.patchtext/x-diff; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_.patchDownload

From 36e9713cd7ab1c2331e1845b71e2fde4104322f4 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 27 Jun 2014 23:35:29 +0900
Subject: [PATCH 2/3] Extract generic bash initialization process from
 pg_upgrade

Such initialization is useful as well for some other utilities and makes
test settings consistent.
---
 contrib/pg_upgrade/test.sh | 47 ++++--------------------------------
 src/test/shell/init_env.sh | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 43 deletions(-)
 create mode 100644 src/test/shell/init_env.sh

diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 7bbd2c7..2e1c61a 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -9,24 +9,14 @@
 # Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 
-set -e
-
-: ${MAKE=make}
-
-# Guard against parallel make issues (see comments in pg_regress.c)
-unset MAKEFLAGS
-unset MAKELEVEL
-
-# Establish how the server will listen for connections
-testhost=`uname -s`
+# Initialize test
+. ../../src/test/shell/init_env.sh
 
 case $testhost in
 	MINGW*)
-		LISTEN_ADDRESSES="localhost"
 		PGHOST=""; unset PGHOST
 		;;
 	*)
-		LISTEN_ADDRESSES=""
 		# Select a socket directory.  The algorithm is from the "configure"
 		# script; the outcome mimics pg_regress.c:make_temp_sockdir().
 		PGHOST=$PG_REGRESS_SOCK_DIR
@@ -102,37 +92,8 @@ logdir=$PWD/log
 rm -rf "$logdir"
 mkdir "$logdir"
 
-# Clear out any environment vars that might cause libpq to connect to
-# the wrong postmaster (cf pg_regress.c)
-#
-# Some shells, such as NetBSD's, return non-zero from unset if the variable
-# is already unset. Since we are operating under 'set -e', this causes the
-# script to fail. To guard against this, set them all to an empty string first.
-PGDATABASE="";        unset PGDATABASE
-PGUSER="";            unset PGUSER
-PGSERVICE="";         unset PGSERVICE
-PGSSLMODE="";         unset PGSSLMODE
-PGREQUIRESSL="";      unset PGREQUIRESSL
-PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
-PGHOSTADDR="";        unset PGHOSTADDR
-
-# Select a non-conflicting port number, similarly to pg_regress.c
-PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$newsrc"/src/include/pg_config.h | awk '{print $3}'`
-PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
-export PGPORT
-
-i=0
-while psql -X postgres </dev/null 2>/dev/null
-do
-	i=`expr $i + 1`
-	if [ $i -eq 16 ]
-	then
-		echo port $PGPORT apparently in use
-		exit 1
-	fi
-	PGPORT=`expr $PGPORT + 1`
-	export PGPORT
-done
+# Get a port to run the tests
+pg_get_test_port "$newsrc"
 
 # buildfarm may try to override port via EXTRA_REGRESS_OPTS ...
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --port=$PGPORT"
diff --git a/src/test/shell/init_env.sh b/src/test/shell/init_env.sh
new file mode 100644
index 0000000..d37eb69
--- /dev/null
+++ b/src/test/shell/init_env.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+
+# src/test/shell/init.sh
+#
+# Utility initializing environment for tests to be conducted in shell.
+# The initialization done here is made to be platform-proof.
+
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*)	LISTEN_ADDRESSES="localhost" ;;
+	*)		LISTEN_ADDRESSES="" ;;
+esac
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return nonzero from unset if the variable
+# is already unset. Since we are operating under 'set e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+# Select a non-conflicting port number, similarly to pg_regress.c, and
+# save its value in PGPORT. Caller should either save or use this value
+# for the tests.
+pg_get_test_port()
+{
+	PG_ROOT_DIR=$1
+	PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$PG_ROOT_DIR"/src/include/pg_config.h | awk '{print $3}'`
+	PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+	export PGPORT
+
+	i=0
+	while psql -X postgres </dev/null 2>/dev/null
+	do
+		i=`expr $i + 1`
+		if [ $i -eq 16 ]
+		then
+			echo port $PGPORT apparently in use
+			exit 1
+		fi
+		PGPORT=`expr $PGPORT + 1`
+		export PGPORT
+	done
+}
-- 
2.0.0

0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/x-diff; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload

From b396680f3b37285aa72bb5fccc8bbfeb47811306 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 2 Jul 2014 14:20:08 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines that
can be used to check for consistency at page level when replaying WAL
files among several nodes of a cluster (generally master and standby
node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then each
buffer is captured is with the following format as a single line of
the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between pages,
and format is chosen to facilitate comparison between buffer entries.
- A client part, located in contrib/buffer_capture_cmp, that can be used
to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a CFLAGS
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in test-default.sh
but user is free to set up custom tests by creating a file called
test-custom.sh that can be kicked by the test facility if this file
is present instead of the defaults.
---
 contrib/Makefile                                |   1 +
 contrib/buffer_capture_cmp/.gitignore           |   9 +
 contrib/buffer_capture_cmp/Makefile             |  35 ++
 contrib/buffer_capture_cmp/README               |  33 ++
 contrib/buffer_capture_cmp/buffer_capture_cmp.c | 329 ++++++++++++++++
 contrib/buffer_capture_cmp/test-default.sh      |  14 +
 contrib/buffer_capture_cmp/test.sh              | 161 ++++++++
 src/backend/access/heap/heapam.c                |  11 +
 src/backend/storage/buffer/Makefile             |   2 +-
 src/backend/storage/buffer/bufcapt.c            | 482 ++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c             |  40 +-
 src/backend/storage/lmgr/lwlock.c               |   8 +
 src/backend/storage/page/bufpage.c              |   3 +
 src/include/miscadmin.h                         |  13 +
 src/include/storage/bufcapt.h                   |  65 ++++
 15 files changed, 1204 insertions(+), 2 deletions(-)
 create mode 100644 contrib/buffer_capture_cmp/.gitignore
 create mode 100644 contrib/buffer_capture_cmp/Makefile
 create mode 100644 contrib/buffer_capture_cmp/README
 create mode 100644 contrib/buffer_capture_cmp/buffer_capture_cmp.c
 create mode 100644 contrib/buffer_capture_cmp/test-default.sh
 create mode 100644 contrib/buffer_capture_cmp/test.sh
 create mode 100644 src/backend/storage/buffer/bufcapt.c
 create mode 100644 src/include/storage/bufcapt.h

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..1c8e6b9 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		btree_gin	\
 		btree_gist	\
+		buffer_capture_cmp \
 		chkpass		\
 		citext		\
 		cube		\
diff --git a/contrib/buffer_capture_cmp/.gitignore b/contrib/buffer_capture_cmp/.gitignore
new file mode 100644
index 0000000..ecd8b78
--- /dev/null
+++ b/contrib/buffer_capture_cmp/.gitignore
@@ -0,0 +1,9 @@
+# Binary generated
+/buffer_capture_cmp
+
+# Regression tests
+/capture_differences.txt
+/tmp_check
+
+# Custom test file
+/test-custom.sh
diff --git a/contrib/buffer_capture_cmp/Makefile b/contrib/buffer_capture_cmp/Makefile
new file mode 100644
index 0000000..83f2725
--- /dev/null
+++ b/contrib/buffer_capture_cmp/Makefile
@@ -0,0 +1,35 @@
+# contrib/buffer_capture_cmp/Makefile
+
+PGFILEDESC = "buffer_capture_cmp - Comparator tool between buffer captures"
+PGAPPICON = win32
+
+PROGRAM = buffer_capture_cmp
+OBJS	= buffer_capture_cmp.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir) -DFRONTEND
+PG_LIBS = $(libpq_pgport)
+
+EXTRA_CLEAN = tmp_check/ capture_differences.txt
+
+# test.sh creates a cluster dedicated for the test
+EXTRA_REGRESS_OPTS=--use-existing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/buffer_capture_cmp
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# Tests can be done only when BUFFER_CAPTURE is defined
+ifneq (,$(filter -DBUFFER_CAPTURE,$(CFLAGS)))
+check: test.sh all
+	MAKE=$(MAKE) bindir=$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) ./test.sh
+else
+check:
+	echo "BUFFER_CAPTURE is not defined in CFLAGS, so no tests to run"
+endif
diff --git a/contrib/buffer_capture_cmp/README b/contrib/buffer_capture_cmp/README
new file mode 100644
index 0000000..8a0a154
--- /dev/null
+++ b/contrib/buffer_capture_cmp/README
@@ -0,0 +1,33 @@
+buffer_capture_cmp
+------------------
+
+This facility already contains a set of regression tests that can be run
+by default with the following command:
+
+    make check
+
+The regression scripts contain a hook that can be used as an entry point
+to run some custom tests using this facility. Simply create in this folder
+a file called test-custom.sh and execute all the commands necessary for the
+tests. This custom script should use PGPORT as port to connect to the
+server where tests are run.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+It is possible to use pg_xlogdump to see which WAL record a page image
+corresponds to. But beware that the LSN in the page image points to the
+*end* of the WAL record, while the LSN that pg_xlogdump prints is the
+*beginning* of the WAL record. So to find which WAL record a page image
+corresponds to, find the LSN from the page image in pg_xlogdump output,
+and back off one record. (you can't just grep for the line containing
+the LSN).
diff --git a/contrib/buffer_capture_cmp/buffer_capture_cmp.c b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
new file mode 100644
index 0000000..dcb1b3f
--- /dev/null
+++ b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
@@ -0,0 +1,329 @@
+/*-------------------------------------------------------------------------
+ *
+ * buffer_capture_cmp.c
+ *	  Utility tool to compare buffer captures between two nodes of
+ *	  a cluster. This utility needs to be run on servers whose code
+ *	  has been built with the symbol BUFFER_CAPTURE defined.
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    contrib/buffer_capture_cmp/buffer_capture_cmp.c
+ *
+ * Capture files that can be obtained on nodes of a cluster do not
+ * necessarily have the page images logged in the same order. For
+ * example, if a single WAL-logged operation modifies multiple pages,
+ * like an index page split, the standby might release the locks
+ * in different order than the master. Another cause is concurrent
+ * operations; writing the page images is not atomic with WAL insertion,
+ * so if two backends are running concurrently, their modifications in
+ * the image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines from the capture file are read into a reorder
+ * buffer, and sorted there. Sorting the whole file would be overkill,
+ * as the lines are mostly in order already. The fixed-size reorder
+ * buffer works as long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ */
+
+#include "postgres_fe.h"
+#include "port.h"
+
+/* Size of a single entry of the capture file */
+#define LINESZ (BLCKSZ*2 + 31)
+
+/* Line reordering stuff */
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	   *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* Number of lines currently in buffer */
+
+	FILE	   *fp;
+	bool		eof;		/* Has EOF been reached from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the capture file into the reorder buffer, until the
+ * buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+
+		/*
+		 * Common case, entry goes to the end. This happens for an
+		 * initialization or when buffer compares to be higher than
+		 * the last buffer in queue.
+		 */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* Find the right place in the queue */
+			int			i;
+
+			/*
+			 * Scan all the existing buffers and find where buffer needs
+			 * to be included. We already know that the comparison result
+			 * we the last buffer in list.
+			 */
+			for (i = buf->nlines - 1; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+
+			/* Place buffer correctly in the list */
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+/*
+ * Initialize a reorder buffer.
+ */
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		/* Fetch LSN from the first 8 bytes of the buffer */
+		sscanf(buf->lines[nlines], "LSN: %08X/%08X",
+			   &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* Consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+	return nlines;
+}
+
+/*
+ * Free all the given records.
+ */
+static void
+freerecord(char **lines, int nlines)
+{
+	int                     i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+/*
+ * Print out given records.
+ */
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+/*
+ * Do a direct comparison between the two given records that have the
+ * same LSN entry.
+ */
+static bool
+diffrecord(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	/* Leave if they do not have the same number of records */
+	if (nlines_a != nlines_b)
+		return true;
+
+	/*
+	 * Now do a comparison line by line. If any diffs are found at
+	 * character-level they will be printed out.
+	 */
+	for (i = 0; i < nlines_a; i++)
+	{
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			int strlen_a = strlen(lines_a[i]);
+			int strlen_b = strlen(lines_b[i]);
+			int j;
+
+			printf("strlen_a: %d, strlen_b: %d\n",
+				   strlen_a, strlen_b);
+			for (j = 0; j < strlen_a; j++)
+			{
+				char char_a = lines_a[i][j];
+				char char_b = lines_b[i][j];
+				if (char_a != char_b)
+					printf("position: %d, char_a: %c, char_b: %c\n",
+						   j, char_a, char_b);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's capture file> <standby's capture file>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	char	   *path_a, *path_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	/* Open first file */
+	path_a = pg_strdup(argv[1]);
+	fp_a = fopen(path_a, "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_a);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Open second file */
+	path_b = pg_strdup(argv[2]);
+	fp_b = fopen(path_b, "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_b);
+		fprintf(stderr, "Check if server binaries are built with symbol BUFFER_CAPTURE");
+		exit(2);
+	}
+
+	/* Initialize buffers for first loop */
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* Read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	/* Do comparisons as long as there are entries */
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* Compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecord(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("Lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
new file mode 100644
index 0000000..5bec503
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+# Default test suite for buffer_compare_cmp
+
+# PGPORT is already set so process should refer to that when kicking tests
+
+# In order to run the regression test suite, copy this file as test-custom.sh
+# and then uncomment the following lines:
+# echo ROOT_DIR=$PWD
+# psql -c 'CREATE DATABASE regression'
+# cd ../../src/test/regress && make installcheck 2>&1 > /dev/null
+# cd $ROOT_DIR
+
+# Create a simple table
+psql -c 'CREATE TABLE aa AS SELECT generate_series(1, 10) AS a'
diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
new file mode 100644
index 0000000..d44cd56
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -0,0 +1,161 @@
+#!/bin/bash
+
+# contrib/buffer_capture_cmp/test.sh
+#
+# Test driver for buffer_capture_cmp. It does the following processing:
+#
+# 1) Initialization of a master and a standby
+# 2) Stop master, then standby
+# 3) Remove $PGDATA/buffer_capture on master and standby
+# 4) Start master, then standby
+# 5) Run custom or default series of tests
+# 6) Stop master, then standby
+# 7) Compare the buffer capture of both nodes
+# 8) The diffence file should be empty
+#
+# Portions Copyright (c) 2006-2014, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+
+# Initialize environment
+. ../../src/test/shell/init_env.sh
+
+# Adjust these paths for your environment
+TESTROOT=$PWD/tmp_check
+TEST_MASTER=$TESTROOT/data_master
+TEST_STANDBY=$TESTROOT/data_standby
+
+# Create the root folder for test data
+if [ ! -d $TESTROOT ]; then
+	mkdir -p $TESTROOT
+fi
+
+export PGDATABASE="postgres"
+
+# Set up PATH correctly
+PATH=$bindir:$PATH
+export PATH
+
+# Get port values for master node
+pg_get_test_port ../..
+PG_MASTER_PORT=$PGPORT
+
+# Enable echo so the user can see what is being executed
+set -x
+
+# Fetch file name containing buffer captures
+CAPTURE_FILE_NAME=`grep '#define BUFFER_CAPTURE_FILE' "$PG_ROOT_DIR"/src/include/storage/bufcapt.h | awk '{print $3}' | sed 's/\"//g'`
+CAPTURE_FILE_MASTER=$TEST_MASTER/$CAPTURE_FILE_NAME
+CAPTURE_FILE_STANDBY=$TEST_STANDBY/$CAPTURE_FILE_NAME
+
+# Set up the nodes, first the master
+rm -rf $TEST_MASTER
+initdb -N -A trust -D $TEST_MASTER
+
+# Custom parameters for master's postgresql.conf
+cat >> $TEST_MASTER/postgresql.conf <<EOF
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+checkpoint_segments = 50
+shared_buffers = 1MB
+log_line_prefix = 'M  %m %p '
+hot_standby = on
+autovacuum = off
+max_connections = 50
+listen_addresses = '$LISTEN_ADDRESSES'
+port = $PG_MASTER_PORT
+EOF
+
+# Accept replication connections on master
+cat >> $TEST_MASTER/pg_hba.conf <<EOF
+local replication all trust
+host replication all 127.0.0.1/32 trust
+host replication all ::1/128 trust
+EOF
+
+# Start master
+pg_ctl -w -D $TEST_MASTER start
+
+# Now the standby
+echo "Master initialized and running."
+
+# Set up standby with necessary parameters
+rm -rf $TEST_STANDBY
+
+# Base backup is taken with xlog files included
+pg_basebackup -D $TEST_STANDBY -p $PG_MASTER_PORT -x
+
+# Get a fresh port value for the standby
+pg_get_test_port ../..
+PG_STANDBY_PORT=$PGPORT
+echo standby: $PG_STANDBY_PORT master: $PG_MASTER_PORT
+echo "port = $PG_STANDBY_PORT" >> $TEST_STANDBY/postgresql.conf
+
+# Still need to set up PGPORT for subsequent tests
+PGPORT=$PG_MASTER_PORT
+export PGPORT
+
+cat > $TEST_STANDBY/recovery.conf <<EOF
+primary_conninfo='port=$PG_MASTER_PORT'
+standby_mode=on
+recovery_target_timeline='latest'
+EOF
+
+# Start standby
+pg_ctl -w -D $TEST_STANDBY start
+
+# Stop both nodes and remove the file containing the buffer captures
+# Master needs to be stopped first.
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+rm -rf $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY
+
+# Re-start the nodes
+pg_ctl -w -D $TEST_MASTER start
+pg_ctl -w -D $TEST_STANDBY start
+
+# Check the presence of custom tests and kick them in priority. If not,
+# fallback to the default tests. Tests need only to be run on the master
+# node.
+if [ -f ./test-custom.sh ]; then
+	. ./test-custom.sh
+else
+	. ./test-default.sh
+fi
+
+# Time to stop the nodes as tests have run
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+
+DIFF_FILE=capture_differences.txt
+
+# Now compare the buffer images
+# Disable erroring here, buffer capture file may not be present
+# if cluster has not been built with symbol BUFFER_CAPTURE
+set +e
+./buffer_capture_cmp $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY > $DIFF_FILE
+ERR_NUM=$?
+
+# Cover the case where capture file does not exist
+if [ $ERR_NUM == 2 ]; then
+	echo "Capture file does not exist"
+	echo "PASSED"
+	exit 0
+elif [ $ERR_NUM == 1 ]; then
+	echo "FAILED"
+	exit 1
+fi
+
+# No need to echo commands anymore
+set +x
+echo
+
+# Test passes if there are no diffs!
+if [ ! -s $DIFF_FILE ]; then
+	echo "PASSED"
+    exit 0
+else
+	echo "Diffs exist in the buffer captures"
+	echo "FAILED"
+	exit 1
+fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6861ae0..d43406c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,9 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -7010,6 +7013,14 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+#ifdef BUFFER_CAPTURE
+	/*
+	 * The normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	buffer_capture_write(page, blkno);
+#endif
+
 	return recptr;
 }
 
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..6ec85b0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufcapt.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufcapt.c b/src/backend/storage/buffer/bufcapt.c
new file mode 100644
index 0000000..9e5dfc1
--- /dev/null
+++ b/src/backend/storage/buffer/bufcapt.c
@@ -0,0 +1,482 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.c
+ *	  Routines for buffer capture, including masking and dynamic buffer
+ *	  snapshot.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufcapt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufcapt.h"
+#include "storage/bufmgr.h"
+#include "storage/lwlock.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER				0xFF
+
+/* Support for capturing changes to pages per process */
+#define MAX_BEFORE_IMAGES		100
+
+typedef struct
+{
+	Buffer		  buffer;
+	char			content[BLCKSZ];
+} BufferImage;
+
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+static FILE *imagefp;
+static int before_images_cnt = 0;
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ */
+static void mask_unused_space(char *page);
+static void mask_heap_page(char *page);
+static void mask_spgist_page(char *page);
+static void mask_gist_page(char *page);
+static void mask_gin_page(BlockNumber blkno, char *page);
+static void mask_sequence_page(char *page);
+static void mask_btree_page(char *page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page)
+{
+	int		 pd_lower = ((PageHeader) page)->pd_lower;
+	int		 pd_upper = ((PageHeader) page)->pd_upper;
+	int		 pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(char *page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	  iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int	 len = ItemIdGetLength(iid);
+			int	 padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(BlockNumber blkno, char *page)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(char *page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(char *page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+		(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be better with more refinement.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/* ----------------------------------------------------------------
+ * Buffer capture functions.
+ *
+ * Those functions can be used to memmorize the content of pages
+ * and flush them to BUFFER_CAPTURE_FILE when necessary.
+ */
+static bool
+buffer_capture_is_changed(BufferImage *img)
+{
+	Page			newcontent = BufferGetPage(img->buffer);
+	Page			oldcontent = (Page) img->content;
+
+	if (PageGetLSN(oldcontent) == PageGetLSN(newcontent))
+		return false;
+	return true;
+}
+
+/*
+ * Initialize page capture
+ */
+void
+buffer_capture_init(void)
+{
+	int				i;
+	BufferImage	   *images;
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen(BUFFER_CAPTURE_FILE, "ab");
+}
+
+/*
+ * buffer_capture_reset
+ *
+ * Reset buffer captures
+ */
+void
+buffer_capture_reset(void)
+{
+	if (before_images_cnt > 0)
+		elog(LOG, "Released all buffer captures");
+	before_images_cnt = 0;
+}
+
+/*
+ * buffer_capture_write
+ *
+ * Flush to file the new content page present here after applying a
+ * mask on it.
+ */
+void
+buffer_capture_write(char *newcontent,
+					 uint32 blkno)
+{
+	XLogRecPtr	newlsn = PageGetLSN((Page) newcontent);
+	char		page[BLCKSZ];
+	uint16		tail;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	/* Copy content of page before any operation */
+	memcpy(page, newcontent, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		mask_heap_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)))
+	{
+		/*
+		 * It happens that btree and gist have the same size of special
+		 * area.
+		 */
+		if (tail == GIST_PAGE_ID)
+			mask_gist_page(page);
+		else if (tail <= MAX_BT_CYCLE_ID)
+			mask_btree_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == 8)
+	{
+		/*
+		 * XXX: Page detection for sequences can be improved.
+		 */
+		if (tail == SPGIST_PAGE_ID)
+			mask_spgist_page(page);
+		else if (*((uint32 *) (page + BLCKSZ - MAXALIGN(sizeof(uint32)))) == SEQ_MAGIC)
+			mask_sequence_page(page);
+		else
+			mask_gin_page(blkno, page);
+	}
+
+	/*
+	 * First write the LSN in a constant format to facilitate comparisons
+	 * between buffer captures.
+	 */
+	fprintf(imagefp, "LSN: %08X/%08X ",
+			(uint32) (newlsn >> 32), (uint32) newlsn);
+
+	/* Then write the page contents, in hex */
+	fprintf(imagefp, "page: ");
+	{
+		char	buf[BLCKSZ * 2];
+		int     j = 0;
+		int		i;
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte = (uint8) page[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+	fprintf(imagefp, "\n");
+
+	/* Then the masked page in hex format */
+	fflush(imagefp);
+
+	/* Clean up */
+	LWLockRelease(SyncScanLock);
+}
+
+/*
+ * buffer_capture_remember
+ *
+ * Append a page content to the existing list of buffers to-be-captured.
+ */
+void
+buffer_capture_remember(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy(img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * buffer_capture_forget
+ *
+ * Forget a page image. If the page was modified, log the new contents.
+ */
+void
+buffer_capture_forget(Buffer buffer)
+{
+	int	i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+			/* If page has new content, capture it */
+			if (buffer_capture_is_changed(img))
+			{
+				Page content = BufferGetPage(img->buffer);
+				RelFileNode	rnode;
+				ForkNumber	forknum;
+				BlockNumber	blkno;
+
+				BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+				buffer_capture_write(content, blkno);
+			}
+
+			if (i != before_images_cnt)
+			{
+				/* Swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+
+			before_images_cnt--;
+			return;
+		}
+	}
+	elog(LOG, "could not find image for buffer %u", buffer);
+}
+
+/*
+ * buffer_capture_scan
+ *
+ * See if any of the buffers that have been memorized have changed.
+ */
+void
+buffer_capture_scan(void)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (buffer_capture_is_changed(img))
+		{
+			Page content = BufferGetPage(img->buffer);
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+
+			BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+			buffer_capture_write(content, blkno);
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
+
+#endif /* BUFFER_CAPTURE */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..b1dd4ce 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -50,6 +50,9 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -1708,6 +1711,10 @@ AtEOXact_Buffers(bool isCommit)
 {
 	CheckForBufferLeaks();
 
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1724,6 +1731,10 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef BUFFER_CAPTURE
+	buffer_capture_init();
+#endif
 }
 
 /*
@@ -2749,6 +2760,20 @@ LockBuffer(Buffer buffer, int mode)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * XXX: peek into the LWLock struct to see if we're holding it in
+		 * exclusive or shared mode. This is concurrency-safe: if we're holding
+		 * it in exclusive mode, no-one else can release it. If we're holding
+		 * it in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			buffer_capture_forget(buffer);
+	}
+#endif
+
 	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
@@ -2757,6 +2782,11 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		buffer_capture_remember(buffer);
+#endif
 }
 
 /*
@@ -2768,6 +2798,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	res;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2775,7 +2806,14 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	res = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+
+#ifdef BUFFER_CAPTURE
+	if (res)
+		buffer_capture_remember(buffer);
+#endif
+
+	return res;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a62af27..4671a55 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -41,6 +41,10 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
+
 #ifdef LWLOCK_STATS
 #include "utils/hsearch.h"
 #endif
@@ -1272,6 +1276,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..6beaa15 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -21,6 +21,9 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		ignore_checksum_failure = false;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..1ae98f7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -116,11 +116,24 @@ do { \
 
 #define START_CRIT_SECTION()  (CritSectionCount++)
 
+#ifdef BUFFER_CAPTURE
+/* in src/backend/storage/buffer/bufcapt.c */
+void buffer_capture_scan(void);
+
+#define END_CRIT_SECTION() \
+do { \
+	Assert(CritSectionCount > 0); \
+	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		buffer_capture_scan(); \
+} while(0)
+#else
 #define END_CRIT_SECTION() \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
 } while(0)
+#endif
 
 
 /*****************************************************************************
diff --git a/src/include/storage/bufcapt.h b/src/include/storage/bufcapt.h
new file mode 100644
index 0000000..089e5a7
--- /dev/null
+++ b/src/include/storage/bufcapt.h
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcaptr.h
+ *	  Buffer capture definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufcapt.h
+ *
+ * About BUFFER_CAPTURE:
+ *
+ * If this symbol is defined, all page images that are logged on this
+ * server are as well flushed to BUFFER_CAPTURE_FILE. One line of the
+ * output file is used for a single page image.
+ *
+ * The page images obtained are aimed to be used with the utility tool
+ * called buffer_capture_cmp available in contrib/ and can be used to
+ * compare how WAL is replayed between master and standby nodes, helping
+ * in spotting bugs in this area. As each page is written in hexadecimal
+ * format, one single page writes BLCKSZ * 2 bytes to the capture file.
+ * Hexadecimal format makes it easier to spot differences between captures
+ * done among nodes. Be aware that this has a large impact on I/O and that
+ * it is aimed only for buildfarm and development purposes.
+ *
+ * One single page entry has the following format:
+ *	LSN: %08X/08X page: PAGE_IN_HEXA
+ *
+ * The LSN corresponds to the log sequence number to which the page image
+ * applies to, then the content of the image is added as-is. This format
+ * is chosen to facilitate comparisons between each capture entry
+ * particularly in cases where LSN increases its digit number.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFCAPT_H
+#define BUFCAPT_H
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Output file where buffer captures are stored
+ */
+#define BUFFER_CAPTURE_FILE "buffer_captures"
+
+void buffer_capture_init(void);
+void buffer_capture_reset(void);
+void buffer_capture_write(char *newcontent,
+						  uint32 blkno);
+
+void buffer_capture_remember(Buffer buffer);
+void buffer_capture_forget(Buffer buffer);
+void buffer_capture_scan(void);
+
+#endif /* BUFFER_CAPTURE */
+
+#endif
-- 
2.0.0

#24

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Michael Paquier (#23)

Re: WAL replay bugs

Hello,

Thanks for your comments. Looking forward to seeing some more input.

Thank you. bufcapt.c was a poser.

bufcapt.c: 326 memcpy(&tail, &page[BLCKSZ - 2], 2);

This seems duzzling.. Isn't "*(uint16*)(&page[BLCKSZ - 2])" applicable?

bufcapt.c: 331 else if (PageGetSpecial....

Generally saying, the code to identify the type of page is too
heuristic and seems fragile.

Pehaps this is showing that no tidy or generalized way to tell
what a page is used for. Many of the modules which have their
own page format has a magic value and the values seem to be
selected carefully. But no one-for-all method to retrieve that.

Each type of page can be confirmed by the following way *if*
its type is previously *hinted* except for gin.

btree : 32bit magic at pd->opaque
gin : No magic
gist : 16-bit magic at ((GISTPageOpaque*)pd->opaque)->gist_page_id
spgist : 16-bit magic at ((SpGistPageOpaque*)pd->opaque)->spgist_page_id
hash : 16-bit magic at ((HashPageOpaque*)pd->paque)->hasho_page_id
sequence : 16-bit magic at pd->opaque, the last 2 bytes of the page.

# Is it comprehensive? and correct?

The majority is 16-bit magic at the TAIL of opaque struct. If
we can unify them into , say, 16-bit magic at
*(uint16*)(pd->opaque) the sizeof() labyrinth become stable and
simple and other tools which should identify the type of pages
will be happy. Do you think we can do that?

# Sorry, time's up for today.

- contrib/buffer_capture_cmp/README

Yeah right... This was a rest of some previous hacking on this feature.
Paragraph was rather unclear so I rewrote it, mentioning that the custom
script can use PGPORT to connect to the node where tests can be run.

- contrib/buffer_capture_cmp/Makefile

"make check" does nothing when BUFFER_CAPTURE is not defined, as

...

Sure, I added such a message in the makefile.

- buffer_capture_cmp.c

This source generates void executable when BUFFER_CAPTURE is

Done. The compilation of this utility is now independent on BUFFER_CAPTURE.
At the same time I made test.sh a bit smarter to have it grab the value of
BUFFER_CAPTURE_FILE directly from bufcapt.h.

- buffer_capture_cmp.c/main()

The parameters for this command are the parent directories for

...

Fixed. I changed back the utility to directly file names instead of data
folders as arguments.

Updated patches addressing those comments are attached.
Regards,

Thank you I'll look into them later.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#24)

Re: WAL replay bugs

TODO

On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

bufcapt.c: 326 memcpy(&tail, &page[BLCKSZ - 2], 2);

This seems duzzling.. Isn't "*(uint16*)(&page[BLCKSZ - 2])" applicable?

Well yes it is.

Pehaps this is showing that no tidy or generalized way to tell
what a page is used for. Many of the modules which have their
own page format has a magic value and the values seem to be
selected carefully. But no one-for-all method to retrieve that.

You have a point here.

Each type of page can be confirmed by the following way *if*
its type is previously *hinted* except for gin.

btree : 32bit magic at pd->opaque
gin : No magic
gist : 16-bit magic at ((GISTPageOpaque*)pd->opaque)->gist_page_id
spgist : 16-bit magic at ((SpGistPageOpaque*)pd->opaque)->spgist_page_id
hash : 16-bit magic at ((HashPageOpaque*)pd->paque)->hasho_page_id
sequence : 16-bit magic at pd->opaque, the last 2 bytes of the page.

# Is it comprehensive? and correct?

Sequence pages use the last 4 bytes. Have a look at sequence_magic in
sequence.c.
For btree pages we can use the last 2 bytes and a check on MAX_BT_CYCLE_ID.
For gin, I'll investigate if it is possible to add a identifier like
GIN_PAGE_ID, it would make the page analysis more consistent with the
others. I am not sure for what the 8 bytes allocated for the special
area are used now for though.

The majority is 16-bit magic at the TAIL of opaque struct. If
we can unify them into , say, 16-bit magic at
*(uint16*)(pd->opaque) the sizeof() labyrinth become stable and
simple and other tools which should identify the type of pages
will be happy. Do you think we can do that?

Yes I think so. I'll raise a different thread though as this is a
different problem that what this patch is targeting. I would even
imagine a macro in bufpage.c able to handle that well.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Michael Paquier (#25)

Re: WAL replay bugs

Michael Paquier <michael.paquier@gmail.com> writes:

On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Pehaps this is showing that no tidy or generalized way to tell
what a page is used for. Many of the modules which have their
own page format has a magic value and the values seem to be
selected carefully. But no one-for-all method to retrieve that.

You have a point here.

Yeah, it's a bit messy, but I believe it's currently always possible to
tell which access method a PG page belongs to. Look at pg_filedump.
The last couple times we added index access methods, we took pains to
make sure pg_filedump could figure out what their pages were. (IIRC,
it's a combination of the special-space size and contents, but I'm too
tired to go check the details right now.)

For gin, I'll investigate if it is possible to add a identifier like
GIN_PAGE_ID, it would make the page analysis more consistent with the
others. I am not sure for what the 8 bytes allocated for the special
area are used now for though.

There is exactly zero chance that anyone will accept an on-disk format
change just to make this prettier.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Tom Lane (#26)

Re: WAL replay bugs

On Thu, Jul 3, 2014 at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Pehaps this is showing that no tidy or generalized way to tell
what a page is used for. Many of the modules which have their
own page format has a magic value and the values seem to be
selected carefully. But no one-for-all method to retrieve that.

You have a point here.

Yeah, it's a bit messy, but I believe it's currently always possible to
tell which access method a PG page belongs to. Look at pg_filedump.
The last couple times we added index access methods, we took pains to
make sure pg_filedump could figure out what their pages were. (IIRC,
it's a combination of the special-space size and contents, but I'm too
tired to go check the details right now.)

Yes, that's what the current code does btw, in this *messy* way.

For gin, I'll investigate if it is possible to add a identifier like
GIN_PAGE_ID, it would make the page analysis more consistent with the
others. I am not sure for what the 8 bytes allocated for the special
area are used now for though.

There is exactly zero chance that anyone will accept an on-disk format
change just to make this prettier.

Yeah thought so.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#24)

Re: WAL replay bugs

Hello, This is the additional comments for other part.

I haven't see significant defect in the code so far.

===== bufcapt.c:

- buffer_capture_remember() or so.

Pages for unlogged tables are avoided to be written taking
advantage that the lsn for such pages stays 0/0. I'd like to see
a comment mentioning for, say, buffer_capture_is_changed? or
buffer_capture_forget or somewhere.

- buffer_capture_forget()

However this error is likely not to occur, in the error message
"could not find image...", the buffer id seems to bring no
information. LSN, or quadraplet of LSN, rnode, forknum and
blockno seems to be more informative.

- buffer_capture_is_changed()

The name for the function semes to be misleading. What this
function does is comparing LSNs between stored page image and
current page. lsn_is_changed(BufferImage) or something like
would be more clearly.

====== bufmgr.c:

- ConditionalLockBuffer()

Sorry for a trivial comment, but the variable 'res' conceales
the meaning. "acquired" seems to be more preferable, isn't it?

- LockBuffer()

There is a 'XXX' comment. The discussion written there seems to
be right, for me. If you mind that peeking into there is a bad
behavior, adding a macro into lwlock.h would help:p

lwlock.h: #define LWLockHoldedExclusive(l) ((l)->exclusive > 0)
lwlock.h: #define LWLockHoldedShared(l) ((l)->shared > 0)

# I don't see this usable so much..

bufmgr.c: if (LWLockHoldedExclusive(buf->content_lock))

If there isn't any particular concern, 'XXX:' should be removed.

===== bufpage.c:

- #include "storage/bufcapt.h"

The include seems to be needless.

===== bufcapt.h:

- header comment

The file name is misspelled as 'bufcaptr.h'.
Copyright notice of UC is needed? (Other files are the same)

- The includes in this header except for buf.h seem not to be
necessary.

===== init_env.sh:

- pg_get_test_port()

It determines server port using PG_VERSION_NUM, but is it
necessary? Addition to it, the theoretical maximum initial
PGPORT semes to be 65535 but this script search for free port
up to the number larger by 16 from the start, which would be
beyond the edge.

- pg_get_test_port()

It stops with error after 16 times of failure, but the message
complains only about the final attempt. If you want to mention
the port numbers, it might should be 'port $PGPORTSTART-$PGPORT
..' or would be 'All 16 ports attempted failed' or something..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#28)

3 attachment(s)

Re: WAL replay bugs

OK, I have been working more on this patch, improving on-the-fly the
following things on top of what Horiguchi-san has reported:
- Moved sequence page opaque data to sequence.h, renaming it at the same time.
- Improvement of page type identification, particularly for sequences
using a correct opaque data structure. For gin the process is not that
cool, but I guess that there is nothing much to do as it has no proper
page identifier :(

On Thu, Jul 3, 2014 at 7:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

===== bufcapt.c:

- buffer_capture_remember() or so.

Pages for unlogged tables are avoided to be written taking
advantage that the lsn for such pages stays 0/0. I'd like to see
a comment mentioning for, say, buffer_capture_is_changed? or
buffer_capture_forget or somewhere.

Yes, it is worth mentioning and a comment in bufcapt.h seems enough.

- buffer_capture_forget()

However this error is likely not to occur, in the error message
"could not find image...", the buffer id seems to bring no
information. LSN, or quadraplet of LSN, rnode, forknum and
blockno seems to be more informative.

Yesh, this seems informative.

- buffer_capture_is_changed()

The name for the function seems to be misleading. What this
function does is comparing LSNs between stored page image and
current page. lsn_is_changed(BufferImage) or something like
would be clearer.

Hm, yes. This name looks better fine as it remains static within bufcapt.c.

====== bufmgr.c:

- ConditionalLockBuffer()

Sorry for a trivial comment, but the variable 'res' conceales
the meaning. "acquired" seems to be more preferable, isn't it?

Fixed.

- LockBuffer()

There is a 'XXX' comment. The discussion written there seems to
be right, for me. If you mind that peeking into there is a bad
behavior, adding a macro into lwlock.h would help:p

lwlock.h: #define LWLockHoldedExclusive(l) ((l)->exclusive > 0)
lwlock.h: #define LWLockHoldedShared(l) ((l)->shared > 0)

I don't think that there is much to gain with such macros as of now
LWLock->exclusive is only used in the code this patch introduces.

# I don't see this usable so much..

bufmgr.c: if (LWLockHoldedExclusive(buf->content_lock))

If there isn't any particular concern, 'XXX:' should be removed.

Well yes.

===== bufpage.c:

- #include "storage/bufcapt.h"

The include seems to be needless.

Yep.

===== bufcapt.h:

- header comment

The file name is misspelled as 'bufcaptr.h'.

Nicely spotted.

- The includes in this header except for buf.h seem not to be
necessary.

Yep.

===== init_env.sh:

- pg_get_test_port()
It determines server port using PG_VERSION_NUM, but is it
necessary? Addition to it, the theoretical maximum initial
PGPORT seems to be 65535 but this script search for free port
up to the number larger by 16 from the start, which would be
beyond the edge.

Hm, no. As of now, there is still some margin:
PG_VERSION_NUM = 90500
PG_VERSION_NUM % 16384 + 49152 = 57732

- pg_get_test_port()

It stops with error after 16 times of failure, but the message
complains only about the final attempt. If you want to mention
the port numbers, it might should be 'port $PGPORTSTART-$PGPORT
..' or would be 'All 16 ports attempted failed' or something..

Hm. You could complain about pg_upgrade as well now for the same
thing. But I guess that it doesn't hurt to complain back to caller
about the range of ports already in use, so changed this way.

Regards,
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-and-sequence-page-opaque-data-to-sequ.patchtext/plain; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-and-sequence-page-opaque-data-to-sequ.patchDownload

From 2295a5ff426d9f1ddb3ab2b2f08dc9b52f6309b0 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 4 Jul 2014 13:34:47 +0900
Subject: [PATCH 1/3] Move SEQ_MAGIC and sequence page opaque data to
 sequence.h

This can allow a backend process to detect if a page is being used
for a sequence.
---
 src/backend/commands/sequence.c | 34 ++++++++++++----------------------
 src/include/commands/sequence.h | 13 +++++++++++++
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..802aac7 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -46,16 +46,6 @@
 #define SEQ_LOG_VALS	32
 
 /*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
-typedef struct sequence_magic
-{
-	uint32		magic;
-} sequence_magic;
-
-/*
  * We store a SeqTable item for every sequence we have touched in the current
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
@@ -306,7 +296,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 {
 	Buffer		buf;
 	Page		page;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	OffsetNumber offnum;
 
 	/* Initialize first page of relation with special magic number */
@@ -316,9 +306,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	page = BufferGetPage(buf);
 
-	PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
-	sm->magic = SEQ_MAGIC;
+	PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	/* Now insert sequence tuple */
 
@@ -1066,18 +1056,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
 {
 	Page		page;
 	ItemId		lp;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	Form_pg_sequence seq;
 
 	*buf = ReadBuffer(rel, 0);
 	LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buf);
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
 
-	if (sm->magic != SEQ_MAGIC)
+	if (sm->seq_page_id != SEQ_MAGIC)
 		elog(ERROR, "bad magic number in sequence \"%s\": %08X",
-			 RelationGetRelationName(rel), sm->magic);
+			 RelationGetRelationName(rel), sm->seq_page_id);
 
 	lp = PageGetItemId(page, FirstOffsetNumber);
 	Assert(ItemIdIsNormal(lp));
@@ -1541,7 +1531,7 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	char	   *item;
 	Size		itemsz;
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 
 	/* Backup blocks are not used in seq records */
 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
@@ -1564,9 +1554,9 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	 */
 	localpage = (Page) palloc(BufferGetPageSize(buffer));
 
-	PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(localpage);
-	sm->magic = SEQ_MAGIC;
+	PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	item = (char *) xlrec + sizeof(xl_seq_rec);
 	itemsz = record->xl_len - sizeof(xl_seq_rec);
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 8819c00..2455878 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,19 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * Page opaque data in a sequence page
+ */
+typedef struct SequencePageOpaqueData
+{
+	uint32 seq_page_id;
+} SequencePageOpaqueData;
+
+/*
+ * This page ID is for the conveniende to be able to identify if a page
+ * is being used by a sequence.
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
-- 
2.0.0

0002-Extract-generic-bash-initialization-process-from-pg_.patchtext/plain; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_.patchDownload

From e5b19bba44d362feab1e088f79ba14c754dec6e1 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 4 Jul 2014 14:12:11 +0900
Subject: [PATCH 2/3] Extract generic bash initialization process from
 pg_upgrade

Such initialization is useful as well for some other utilities and makes
test settings consistent.
---
 contrib/pg_upgrade/test.sh | 47 +++--------------------------------
 src/test/shell/init_env.sh | 61 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 43 deletions(-)
 create mode 100644 src/test/shell/init_env.sh

diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 7bbd2c7..2e1c61a 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -9,24 +9,14 @@
 # Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 
-set -e
-
-: ${MAKE=make}
-
-# Guard against parallel make issues (see comments in pg_regress.c)
-unset MAKEFLAGS
-unset MAKELEVEL
-
-# Establish how the server will listen for connections
-testhost=`uname -s`
+# Initialize test
+. ../../src/test/shell/init_env.sh
 
 case $testhost in
 	MINGW*)
-		LISTEN_ADDRESSES="localhost"
 		PGHOST=""; unset PGHOST
 		;;
 	*)
-		LISTEN_ADDRESSES=""
 		# Select a socket directory.  The algorithm is from the "configure"
 		# script; the outcome mimics pg_regress.c:make_temp_sockdir().
 		PGHOST=$PG_REGRESS_SOCK_DIR
@@ -102,37 +92,8 @@ logdir=$PWD/log
 rm -rf "$logdir"
 mkdir "$logdir"
 
-# Clear out any environment vars that might cause libpq to connect to
-# the wrong postmaster (cf pg_regress.c)
-#
-# Some shells, such as NetBSD's, return non-zero from unset if the variable
-# is already unset. Since we are operating under 'set -e', this causes the
-# script to fail. To guard against this, set them all to an empty string first.
-PGDATABASE="";        unset PGDATABASE
-PGUSER="";            unset PGUSER
-PGSERVICE="";         unset PGSERVICE
-PGSSLMODE="";         unset PGSSLMODE
-PGREQUIRESSL="";      unset PGREQUIRESSL
-PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
-PGHOSTADDR="";        unset PGHOSTADDR
-
-# Select a non-conflicting port number, similarly to pg_regress.c
-PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$newsrc"/src/include/pg_config.h | awk '{print $3}'`
-PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
-export PGPORT
-
-i=0
-while psql -X postgres </dev/null 2>/dev/null
-do
-	i=`expr $i + 1`
-	if [ $i -eq 16 ]
-	then
-		echo port $PGPORT apparently in use
-		exit 1
-	fi
-	PGPORT=`expr $PGPORT + 1`
-	export PGPORT
-done
+# Get a port to run the tests
+pg_get_test_port "$newsrc"
 
 # buildfarm may try to override port via EXTRA_REGRESS_OPTS ...
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --port=$PGPORT"
diff --git a/src/test/shell/init_env.sh b/src/test/shell/init_env.sh
new file mode 100644
index 0000000..e4d6cdb
--- /dev/null
+++ b/src/test/shell/init_env.sh
@@ -0,0 +1,61 @@
+#!/bin/sh
+
+# src/test/shell/init.sh
+#
+# Utility initializing environment for tests to be conducted in shell.
+# The initialization done here is made to be platform-proof.
+
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*)	LISTEN_ADDRESSES="localhost" ;;
+	*)		LISTEN_ADDRESSES="" ;;
+esac
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return nonzero from unset if the variable
+# is already unset. Since we are operating under 'set e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+# Select a non-conflicting port number, similarly to pg_regress.c, and
+# save its value in PGPORT. Caller should either save or use this value
+# for the tests.
+pg_get_test_port()
+{
+	PG_ROOT_DIR=$1
+	PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$PG_ROOT_DIR"/src/include/pg_config.h | awk '{print $3}'`
+	PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+	PGPORT_START=$PGPORT
+	export PGPORT
+
+	i=0
+	while psql -X postgres </dev/null 2>/dev/null
+	do
+		i=`expr $i + 1`
+		if [ $i -eq 16 ]
+		then
+			echo "Ports from $PGPORT_START to $PGPORT are apparently in use"
+			exit 1
+		fi
+		PGPORT=`expr $PGPORT + 1`
+		export PGPORT
+	done
+}
-- 
2.0.0

0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/plain; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload

From 7236ba43f9cd995aad5f50c3709cde5d7756c410 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 4 Jul 2014 15:28:00 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines that
can be used to check for consistency at page level when replaying WAL
files among several nodes of a cluster (generally master and standby
node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then each
buffer is captured is with the following format as a single line of
the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between pages,
and format is chosen to facilitate comparison between buffer entries.
- A client part, located in contrib/buffer_capture_cmp, that can be used
to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a CFLAGS
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in test-default.sh
but user is free to set up custom tests by creating a file called
test-custom.sh that can be kicked by the test facility if this file
is present instead of the defaults.
---
 contrib/Makefile                                |   1 +
 contrib/buffer_capture_cmp/.gitignore           |   9 +
 contrib/buffer_capture_cmp/Makefile             |  35 ++
 contrib/buffer_capture_cmp/README               |  33 ++
 contrib/buffer_capture_cmp/buffer_capture_cmp.c | 327 +++++++++++++++
 contrib/buffer_capture_cmp/test-default.sh      |  14 +
 contrib/buffer_capture_cmp/test.sh              | 170 ++++++++
 src/backend/access/heap/heapam.c                |  11 +
 src/backend/storage/buffer/Makefile             |   2 +-
 src/backend/storage/buffer/bufcapt.c            | 504 ++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c             |  40 +-
 src/backend/storage/lmgr/lwlock.c               |   8 +
 src/backend/storage/page/bufpage.c              |   1 -
 src/include/commands/sequence.h                 |   2 +-
 src/include/miscadmin.h                         |  13 +
 src/include/storage/bufcapt.h                   |  66 ++++
 16 files changed, 1232 insertions(+), 4 deletions(-)
 create mode 100644 contrib/buffer_capture_cmp/.gitignore
 create mode 100644 contrib/buffer_capture_cmp/Makefile
 create mode 100644 contrib/buffer_capture_cmp/README
 create mode 100644 contrib/buffer_capture_cmp/buffer_capture_cmp.c
 create mode 100644 contrib/buffer_capture_cmp/test-default.sh
 create mode 100644 contrib/buffer_capture_cmp/test.sh
 create mode 100644 src/backend/storage/buffer/bufcapt.c
 create mode 100644 src/include/storage/bufcapt.h

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..1c8e6b9 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		btree_gin	\
 		btree_gist	\
+		buffer_capture_cmp \
 		chkpass		\
 		citext		\
 		cube		\
diff --git a/contrib/buffer_capture_cmp/.gitignore b/contrib/buffer_capture_cmp/.gitignore
new file mode 100644
index 0000000..ecd8b78
--- /dev/null
+++ b/contrib/buffer_capture_cmp/.gitignore
@@ -0,0 +1,9 @@
+# Binary generated
+/buffer_capture_cmp
+
+# Regression tests
+/capture_differences.txt
+/tmp_check
+
+# Custom test file
+/test-custom.sh
diff --git a/contrib/buffer_capture_cmp/Makefile b/contrib/buffer_capture_cmp/Makefile
new file mode 100644
index 0000000..f4d2d8d
--- /dev/null
+++ b/contrib/buffer_capture_cmp/Makefile
@@ -0,0 +1,35 @@
+# contrib/buffer_capture_cmp/Makefile
+
+PGFILEDESC = "buffer_capture_cmp - Comparator tool between buffer captures"
+PGAPPICON = win32
+
+PROGRAM = buffer_capture_cmp
+OBJS	= buffer_capture_cmp.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir) -DFRONTEND
+PG_LIBS = $(libpq_pgport)
+
+EXTRA_CLEAN = tmp_check/ capture_differences.txt
+
+# test.sh creates a cluster dedicated for the test
+EXTRA_REGRESS_OPTS = --use-existing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/buffer_capture_cmp
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# Tests can be done only when BUFFER_CAPTURE is defined
+ifneq (,$(filter -DBUFFER_CAPTURE,$(CFLAGS)))
+check: test.sh all
+	MAKE=$(MAKE) bindir=$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) ./test.sh
+else
+check:
+	echo "BUFFER_CAPTURE is not defined in CFLAGS, so no tests to run"
+endif
diff --git a/contrib/buffer_capture_cmp/README b/contrib/buffer_capture_cmp/README
new file mode 100644
index 0000000..8a0a154
--- /dev/null
+++ b/contrib/buffer_capture_cmp/README
@@ -0,0 +1,33 @@
+buffer_capture_cmp
+------------------
+
+This facility already contains a set of regression tests that can be run
+by default with the following command:
+
+    make check
+
+The regression scripts contain a hook that can be used as an entry point
+to run some custom tests using this facility. Simply create in this folder
+a file called test-custom.sh and execute all the commands necessary for the
+tests. This custom script should use PGPORT as port to connect to the
+server where tests are run.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+It is possible to use pg_xlogdump to see which WAL record a page image
+corresponds to. But beware that the LSN in the page image points to the
+*end* of the WAL record, while the LSN that pg_xlogdump prints is the
+*beginning* of the WAL record. So to find which WAL record a page image
+corresponds to, find the LSN from the page image in pg_xlogdump output,
+and back off one record. (you can't just grep for the line containing
+the LSN).
diff --git a/contrib/buffer_capture_cmp/buffer_capture_cmp.c b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
new file mode 100644
index 0000000..edf9bed
--- /dev/null
+++ b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
@@ -0,0 +1,327 @@
+/*-------------------------------------------------------------------------
+ *
+ * buffer_capture_cmp.c
+ *	  Utility tool to compare buffer captures between two nodes of
+ *	  a cluster. This utility needs to be run on servers whose code
+ *	  has been built with the symbol BUFFER_CAPTURE defined.
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    contrib/buffer_capture_cmp/buffer_capture_cmp.c
+ *
+ * Capture files that can be obtained on nodes of a cluster do not
+ * necessarily have the page images logged in the same order. For
+ * example, if a single WAL-logged operation modifies multiple pages,
+ * like an index page split, the standby might release the locks
+ * in different order than the master. Another cause is concurrent
+ * operations; writing the page images is not atomic with WAL insertion,
+ * so if two backends are running concurrently, their modifications in
+ * the image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines from the capture file are read into a reorder
+ * buffer, and sorted there. Sorting the whole file would be overkill,
+ * as the lines are mostly in order already. The fixed-size reorder
+ * buffer works as long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ */
+
+#include "postgres_fe.h"
+#include "port.h"
+
+/* Size of a single entry of the capture file */
+#define LINESZ (BLCKSZ*2 + 31)
+
+/* Line reordering stuff */
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	   *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* Number of lines currently in buffer */
+
+	FILE	   *fp;
+	bool		eof;		/* Has EOF been reached from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the capture file into the reorder buffer, until the
+ * buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+
+		/*
+		 * Common case, entry goes to the end. This happens for an
+		 * initialization or when buffer compares to be higher than
+		 * the last buffer in queue.
+		 */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* Find the right place in the queue */
+			int			i;
+
+			/*
+			 * Scan all the existing buffers and find where buffer needs
+			 * to be included. We already know that the comparison result
+			 * we the last buffer in list.
+			 */
+			for (i = buf->nlines - 1; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+
+			/* Place buffer correctly in the list */
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+/*
+ * Initialize a reorder buffer.
+ */
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		/* Fetch LSN from the first 8 bytes of the buffer */
+		sscanf(buf->lines[nlines], "LSN: %08X/%08X",
+			   &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* Consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+	return nlines;
+}
+
+/*
+ * Free all the given records.
+ */
+static void
+freerecord(char **lines, int nlines)
+{
+	int                     i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+/*
+ * Print out given records.
+ */
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+/*
+ * Do a direct comparison between the two given records that have the
+ * same LSN entry.
+ */
+static bool
+diffrecord(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	/* Leave if they do not have the same number of records */
+	if (nlines_a != nlines_b)
+		return true;
+
+	/*
+	 * Now do a comparison line by line. If any diffs are found at
+	 * character-level they will be printed out.
+	 */
+	for (i = 0; i < nlines_a; i++)
+	{
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			int strlen_a = strlen(lines_a[i]);
+			int strlen_b = strlen(lines_b[i]);
+			int j;
+
+			printf("strlen_a: %d, strlen_b: %d\n",
+				   strlen_a, strlen_b);
+			for (j = 0; j < strlen_a; j++)
+			{
+				char char_a = lines_a[i][j];
+				char char_b = lines_b[i][j];
+				if (char_a != char_b)
+					printf("position: %d, char_a: %c, char_b: %c\n",
+						   j, char_a, char_b);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's capture file> <standby's capture file>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	char	   *path_a, *path_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	/* Open first file */
+	path_a = pg_strdup(argv[1]);
+	fp_a = fopen(path_a, "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_a);
+		exit(1);
+	}
+
+	/* Open second file */
+	path_b = pg_strdup(argv[2]);
+	fp_b = fopen(path_b, "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_b);
+		exit(1);
+	}
+
+	/* Initialize buffers for first loop */
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* Read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	/* Do comparisons as long as there are entries */
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* Compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecord(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("Lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
new file mode 100644
index 0000000..5bec503
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+# Default test suite for buffer_compare_cmp
+
+# PGPORT is already set so process should refer to that when kicking tests
+
+# In order to run the regression test suite, copy this file as test-custom.sh
+# and then uncomment the following lines:
+# echo ROOT_DIR=$PWD
+# psql -c 'CREATE DATABASE regression'
+# cd ../../src/test/regress && make installcheck 2>&1 > /dev/null
+# cd $ROOT_DIR
+
+# Create a simple table
+psql -c 'CREATE TABLE aa AS SELECT generate_series(1, 10) AS a'
diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
new file mode 100644
index 0000000..89740bb
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -0,0 +1,170 @@
+#!/bin/bash
+
+# contrib/buffer_capture_cmp/test.sh
+#
+# Test driver for buffer_capture_cmp. It does the following processing:
+#
+# 1) Initialization of a master and a standby
+# 2) Stop master, then standby
+# 3) Remove $PGDATA/buffer_capture on master and standby
+# 4) Start master, then standby
+# 5) Run custom or default series of tests
+# 6) Stop master, then standby
+# 7) Compare the buffer capture of both nodes
+# 8) The diffence file should be empty
+#
+# Portions Copyright (c) 2006-2014, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+
+# Initialize environment
+. ../../src/test/shell/init_env.sh
+
+# Adjust these paths for your environment
+TESTROOT=$PWD/tmp_check
+TEST_MASTER=$TESTROOT/data_master
+TEST_STANDBY=$TESTROOT/data_standby
+
+# Create the root folder for test data
+if [ ! -d $TESTROOT ]; then
+	mkdir -p $TESTROOT
+fi
+
+export PGDATABASE="postgres"
+
+# Set up PATH correctly
+PATH=$bindir:$PATH
+export PATH
+
+# Get port values for master node
+pg_get_test_port ../..
+PG_MASTER_PORT=$PGPORT
+
+# Enable echo so the user can see what is being executed
+set -x
+
+# Fetch file name containing buffer captures
+CAPTURE_FILE_NAME=`grep '#define BUFFER_CAPTURE_FILE' "$PG_ROOT_DIR"/src/include/storage/bufcapt.h | awk '{print $3}' | sed 's/\"//g'`
+CAPTURE_FILE_MASTER=$TEST_MASTER/$CAPTURE_FILE_NAME
+CAPTURE_FILE_STANDBY=$TEST_STANDBY/$CAPTURE_FILE_NAME
+
+# Set up the nodes, first the master
+rm -rf $TEST_MASTER
+initdb -N -A trust -D $TEST_MASTER
+
+# Custom parameters for master's postgresql.conf
+cat >> $TEST_MASTER/postgresql.conf <<EOF
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+checkpoint_segments = 50
+shared_buffers = 1MB
+log_line_prefix = 'M  %m %p '
+hot_standby = on
+autovacuum = off
+max_connections = 50
+listen_addresses = '$LISTEN_ADDRESSES'
+port = $PG_MASTER_PORT
+EOF
+
+# Accept replication connections on master
+cat >> $TEST_MASTER/pg_hba.conf <<EOF
+local replication all trust
+host replication all 127.0.0.1/32 trust
+host replication all ::1/128 trust
+EOF
+
+# Start master
+pg_ctl -w -D $TEST_MASTER start
+
+# Now the standby
+echo "Master initialized and running."
+
+# Set up standby with necessary parameters
+rm -rf $TEST_STANDBY
+
+# Base backup is taken with xlog files included
+pg_basebackup -D $TEST_STANDBY -p $PG_MASTER_PORT -x
+
+# Get a fresh port value for the standby
+pg_get_test_port ../..
+PG_STANDBY_PORT=$PGPORT
+echo standby: $PG_STANDBY_PORT master: $PG_MASTER_PORT
+echo "port = $PG_STANDBY_PORT" >> $TEST_STANDBY/postgresql.conf
+
+# Still need to set up PGPORT for subsequent tests
+PGPORT=$PG_MASTER_PORT
+export PGPORT
+
+cat > $TEST_STANDBY/recovery.conf <<EOF
+primary_conninfo='port=$PG_MASTER_PORT'
+standby_mode=on
+recovery_target_timeline='latest'
+EOF
+
+# Start standby
+pg_ctl -w -D $TEST_STANDBY start
+
+# Stop both nodes and remove the file containing the buffer captures
+# Master needs to be stopped first.
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+rm -rf $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY
+
+# Re-start the nodes
+pg_ctl -w -D $TEST_MASTER start
+pg_ctl -w -D $TEST_STANDBY start
+
+# Check the presence of custom tests and kick them in priority. If not,
+# fallback to the default tests. Tests need only to be run on the master
+# node.
+if [ -f ./test-custom.sh ]; then
+	. ./test-custom.sh
+else
+	. ./test-default.sh
+fi
+
+# Time to stop the nodes as tests have run
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+
+DIFF_FILE=capture_differences.txt
+
+# Check if the capture files exist. If not, build may have not been
+# done with BUFFER_CAPTURE enabled.
+if [ ! -f $CAPTURE_FILE_MASTER ]; then
+	echo "Capture file $CAPTURE_FILE_MASTER is missing on master"
+	echo "Has build been done with -DBUFFER_CAPTURE included in CFLAGS"
+	exit 0
+fi
+if [ ! -f $CAPTURE_FILE_STANDBY ]; then
+	echo "Capture file $CAPTURE_FILE_STANDBY is missing on standby"
+	echo "Has build been done with -DBUFFER_CAPTURE included in CFLAGS"
+	exit 0
+fi
+
+# Now compare the buffer images
+# Disable erroring here, buffer capture file may not be present
+# if cluster has not been built with symbol BUFFER_CAPTURE
+set +e
+./buffer_capture_cmp $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY > $DIFF_FILE
+ERR_NUM=$?
+
+# Leave on error
+if [ $ERR_NUM == 1 ]; then
+	echo "FAILED"
+	exit 1
+fi
+
+# No need to echo commands anymore
+set +x
+echo
+
+# Test passes if there are no diffs!
+if [ ! -s $DIFF_FILE ]; then
+	echo "PASSED"
+    exit 0
+else
+	echo "Diffs exist in the buffer captures"
+	echo "FAILED"
+	exit 1
+fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6861ae0..d43406c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,9 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -7010,6 +7013,14 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+#ifdef BUFFER_CAPTURE
+	/*
+	 * The normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	buffer_capture_write(page, blkno);
+#endif
+
 	return recptr;
 }
 
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..6ec85b0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufcapt.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufcapt.c b/src/backend/storage/buffer/bufcapt.c
new file mode 100644
index 0000000..57c8c6e
--- /dev/null
+++ b/src/backend/storage/buffer/bufcapt.c
@@ -0,0 +1,504 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.c
+ *	  Routines for buffer capture, including masking and dynamic buffer
+ *	  snapshot.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufcapt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufcapt.h"
+#include "storage/bufmgr.h"
+#include "storage/lwlock.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER				0xFF
+
+/* Support for capturing changes to pages per process */
+#define MAX_BEFORE_IMAGES		100
+
+typedef struct
+{
+	Buffer		  buffer;
+	char			content[BLCKSZ];
+} BufferImage;
+
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+static FILE *imagefp;
+static int before_images_cnt = 0;
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ */
+static void mask_unused_space(char *page);
+static void mask_heap_page(char *page);
+static void mask_spgist_page(char *page);
+static void mask_gist_page(char *page);
+static void mask_gin_page(BlockNumber blkno, char *page);
+static void mask_sequence_page(char *page);
+static void mask_btree_page(char *page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page)
+{
+	int		 pd_lower = ((PageHeader) page)->pd_lower;
+	int		 pd_upper = ((PageHeader) page)->pd_upper;
+	int		 pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(char *page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	  iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int	 len = ItemIdGetLength(iid);
+			int	 padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(BlockNumber blkno, char *page)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(char *page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(char *page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+		(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be better with more refinement.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/* ----------------------------------------------------------------
+ * Buffer capture functions.
+ *
+ * Those functions can be used to memorize the content of pages
+ * and flush them to BUFFER_CAPTURE_FILE when necessary.
+ */
+static bool
+lsn_is_updated(BufferImage *img)
+{
+	Page			newcontent = BufferGetPage(img->buffer);
+	Page			oldcontent = (Page) img->content;
+
+	if (PageGetLSN(oldcontent) == PageGetLSN(newcontent))
+		return false;
+	return true;
+}
+
+/*
+ * Initialize page capture
+ */
+void
+buffer_capture_init(void)
+{
+	int				i;
+	BufferImage	   *images;
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen(BUFFER_CAPTURE_FILE, "ab");
+}
+
+/*
+ * buffer_capture_reset
+ *
+ * Reset buffer captures
+ */
+void
+buffer_capture_reset(void)
+{
+	if (before_images_cnt > 0)
+		elog(LOG, "Released all buffer captures");
+	before_images_cnt = 0;
+}
+
+/*
+ * buffer_capture_write
+ *
+ * Flush to file the new content page present here after applying a
+ * mask on it.
+ */
+void
+buffer_capture_write(char *newcontent,
+					 uint32 blkno)
+{
+	XLogRecPtr	newlsn = PageGetLSN((Page) newcontent);
+	char		page[BLCKSZ];
+	uint16		tail;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	/* Copy content of page before any operation */
+	memcpy(page, newcontent, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		/* Case of a normal relation, it has an empty special area */
+		mask_heap_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) &&
+			 tail == GIST_PAGE_ID)
+	{
+		/* Gist page */
+		mask_gist_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) &&
+			 tail <= MAX_BT_CYCLE_ID)
+	{
+		/* btree page */
+		mask_btree_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) &&
+			 tail == SPGIST_PAGE_ID)
+	{
+		/* SpGist page */
+		mask_spgist_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(SequencePageOpaqueData)))
+	{
+		/*
+		 * The page found here is used either for a Gin index or a sequence.
+		 * Gin index pages do not have a proper identifier, so check if the page
+		 * is used by a sequence or not. If it is not the case, this page is used
+		 * by a gin index. It is still possible that a gin page covers with area
+		 * with exactly the same value as SEQ_MAGIC, but this is unlikely to happen.
+		 */
+		if (((SequencePageOpaqueData *) PageGetSpecialPointer(page))->seq_page_id == SEQ_MAGIC)
+			mask_sequence_page(page);
+		else
+			mask_gin_page(blkno, page);
+	}
+	else
+	{
+		/* Should not come here */
+		Assert(0);
+	}
+
+	/*
+	 * First write the LSN in a constant format to facilitate comparisons
+	 * between buffer captures.
+	 */
+	fprintf(imagefp, "LSN: %08X/%08X ",
+			(uint32) (newlsn >> 32), (uint32) newlsn);
+
+	/* Then write the page contents, in hex */
+	fprintf(imagefp, "page: ");
+	{
+		char	buf[BLCKSZ * 2];
+		int     j = 0;
+		int		i;
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte = (uint8) page[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+	fprintf(imagefp, "\n");
+
+	/* Then the masked page in hex format */
+	fflush(imagefp);
+
+	/* Clean up */
+	LWLockRelease(SyncScanLock);
+}
+
+/*
+ * buffer_capture_remember
+ *
+ * Append a page content to the existing list of buffers to-be-captured.
+ */
+void
+buffer_capture_remember(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy(img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * buffer_capture_forget
+ *
+ * Forget a page image. If the page was modified, log the new contents.
+ */
+void
+buffer_capture_forget(Buffer buffer)
+{
+	int			i;
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	Page		content;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+			/* If page has new content, capture it */
+			if (lsn_is_updated(img))
+			{
+				content = BufferGetPage(img->buffer);
+				BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+				buffer_capture_write(content, blkno);
+			}
+
+			if (i != before_images_cnt)
+			{
+				/* Swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+
+			before_images_cnt--;
+			return;
+		}
+	}
+
+	/* Gather some information about this buffer image not found */
+	content = BufferGetPage(buffer);
+	BufferGetTag(buffer, &rnode, &forknum, &blkno);
+	elog(LOG, "could not find image for buffer %u: LSN %X/%08X rnode %u, "
+			  "forknum %u, blkno %u",
+		 ((PageHeader) content)->pd_lsn.xlogid,
+		 ((PageHeader) content)->pd_lsn.xrecoff,
+		 buffer, rnode.relNode, forknum, blkno);
+}
+
+/*
+ * buffer_capture_scan
+ *
+ * See if any of the buffers that have been memorized have changed and
+ * update them if it is the case.
+ */
+void
+buffer_capture_scan(void)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (lsn_is_updated(img))
+		{
+			Page content = BufferGetPage(img->buffer);
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+
+			BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+			buffer_capture_write(content, blkno);
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
+
+#endif /* BUFFER_CAPTURE */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..4624a99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -50,6 +50,9 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -1708,6 +1711,10 @@ AtEOXact_Buffers(bool isCommit)
 {
 	CheckForBufferLeaks();
 
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1724,6 +1731,10 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef BUFFER_CAPTURE
+	buffer_capture_init();
+#endif
 }
 
 /*
@@ -2749,6 +2760,20 @@ LockBuffer(Buffer buffer, int mode)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * Peek into the LWLock struct to see if we're holding it in exclusive
+		 * or shared mode. This is concurrency-safe: if we're holding it in
+		 * exclusive mode, no-one else can release it. If we're holding it
+		 * in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			buffer_capture_forget(buffer);
+	}
+#endif
+
 	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
@@ -2757,6 +2782,11 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		buffer_capture_remember(buffer);
+#endif
 }
 
 /*
@@ -2768,6 +2798,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	acquired;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2775,7 +2806,14 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	acquired = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+
+#ifdef BUFFER_CAPTURE
+	if (acquired)
+		buffer_capture_remember(buffer);
+#endif
+
+	return acquired;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a62af27..4671a55 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -41,6 +41,10 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
+
 #ifdef LWLOCK_STATS
 #include "utils/hsearch.h"
 #endif
@@ -1272,6 +1276,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..8b3b83c 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -21,7 +21,6 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
-
 /* GUC variable */
 bool		ignore_checksum_failure = false;
 
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 2455878..cbd6780 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -27,7 +27,7 @@ typedef struct SequencePageOpaqueData
 } SequencePageOpaqueData;
 
 /*
- * This page ID is for the conveniende to be able to identify if a page
+ * This page ID is for the convenience to be able to identify if a page
  * is being used by a sequence.
  */
 #define SEQ_MAGIC		0x1717
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..1ae98f7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -116,11 +116,24 @@ do { \
 
 #define START_CRIT_SECTION()  (CritSectionCount++)
 
+#ifdef BUFFER_CAPTURE
+/* in src/backend/storage/buffer/bufcapt.c */
+void buffer_capture_scan(void);
+
+#define END_CRIT_SECTION() \
+do { \
+	Assert(CritSectionCount > 0); \
+	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		buffer_capture_scan(); \
+} while(0)
+#else
 #define END_CRIT_SECTION() \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
 } while(0)
+#endif
 
 
 /*****************************************************************************
diff --git a/src/include/storage/bufcapt.h b/src/include/storage/bufcapt.h
new file mode 100644
index 0000000..012e5eb
--- /dev/null
+++ b/src/include/storage/bufcapt.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.h
+ *	  Buffer capture definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufcapt.h
+ *
+ * About BUFFER_CAPTURE:
+ *
+ * If this symbol is defined, all page images that are logged on this
+ * server are as well flushed to BUFFER_CAPTURE_FILE. One line of the
+ * output file is used for a single page image.
+ *
+ * The page images obtained are aimed to be used with the utility tool
+ * called buffer_capture_cmp available in contrib/ and can be used to
+ * compare how WAL is replayed between master and standby nodes, helping
+ * in spotting bugs in this area. As each page is written in hexadecimal
+ * format, one single page writes BLCKSZ * 2 bytes to the capture file.
+ * Hexadecimal format makes it easier to spot differences between captures
+ * done among nodes. Be aware that this has a large impact on I/O and that
+ * it is aimed only for buildfarm and development purposes.
+ *
+ * One single page entry has the following format:
+ *	LSN: %08X/08X page: PAGE_IN_HEXA
+ *
+ * The LSN corresponds to the log sequence number to which the page image
+ * applies to, then the content of the image is added as-is. This format
+ * is chosen to facilitate comparisons between each capture entry
+ * particularly in cases where LSN increases its digit number. As unlogged
+ * relations do not have LSN numbers saved, their buffer modifications are
+ * not captured by this facility.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFCAPT_H
+#define BUFCAPT_H
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Output file where buffer captures are stored
+ */
+#define BUFFER_CAPTURE_FILE "buffer_captures"
+
+void buffer_capture_init(void);
+void buffer_capture_reset(void);
+void buffer_capture_write(char *newcontent,
+						  uint32 blkno);
+
+void buffer_capture_remember(Buffer buffer);
+void buffer_capture_forget(Buffer buffer);
+void buffer_capture_scan(void);
+
+#endif /* BUFFER_CAPTURE */
+
+#endif
-- 
2.0.0

#30

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Michael Paquier (#29)

Re: WAL replay bugs

Hello, thank you for considering my comments, including somewhat
impractical ones.

I'll have a look at the latest patch sooner.

At Fri, 4 Jul 2014 15:29:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT_fs_3jLNHYWC6nzej4sBL6DGsLFVCg0JBUkgjeP9Tfw@mail.gmail.com>

OK, I have been working more on this patch, improving on-the-fly the
following things on top of what Horiguchi-san has reported:
- Moved sequence page opaque data to sequence.h, renaming it at the same time.
- Improvement of page type identification, particularly for sequences
using a correct opaque data structure. For gin the process is not that
cool, but I guess that there is nothing much to do as it has no proper
page identifier :(

Year, there seems to be no choice than that.

On Thu, Jul 3, 2014 at 7:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

===== bufcapt.c:

- buffer_capture_remember() or so.

...

Yes, it is worth mentioning and a comment in bufcapt.h seems enough.

- buffer_capture_forget()

...

Yesh, this seems informative.

- buffer_capture_is_changed()

...

Hm, yes. This name looks better fine as it remains static within bufcapt.c.

====== bufmgr.c:

- ConditionalLockBuffer()

...

Fixed.

- LockBuffer()

...

lwlock.h: #define LWLockHoldedExclusive(l) ((l)->exclusive > 0)
lwlock.h: #define LWLockHoldedShared(l) ((l)->shared > 0)

I don't think that there is much to gain with such macros as of now
LWLock->exclusive is only used in the code this patch introduces.

Year, I think so, too:p That's simply for the case if you
thought that.

If there isn't any particular concern, 'XXX:' should be removed.

Well yes.

That's great.

===== bufpage.c:
===== bufcapt.h:

- header comment

The file name is misspelled as 'bufcaptr.h'.

Nicely spotted.

Thank you ;)

- The includes in this header except for buf.h seem not to be
necessary.

Yep.

===== init_env.sh:

- pg_get_test_port()
It determines server port using PG_VERSION_NUM, but is it
necessary? Addition to it, the theoretical maximum initial
PGPORT seems to be 65535 but this script search for free port
up to the number larger by 16 from the start, which would be
beyond the edge.

Hm, no. As of now, there is still some margin:
PG_VERSION_NUM = 90500
PG_VERSION_NUM % 16384 + 49152 = 57732

Yes, it's practically no problem. I said about the theroretical
max value seeing it without any preassumption about the value of
PG_VERSION_NUM. There's in reality no problem before the
PostgreSQL 9.82.88 comes, and 10.0.0 doesn't cause problem. So
I'm not so dissapointed if it is not fixed.

- pg_get_test_port()

It stops with error after 16 times of failure, but the message
complains only about the final attempt. If you want to mention
the port numbers, it might should be 'port $PGPORTSTART-$PGPORT
..' or would be 'All 16 ports attempted failed' or something..

Hm. You could complain about pg_upgrade as well now for the same
thing. But I guess that it doesn't hurt to complain back to caller
about the range of ports already in use, so changed this way.

Yes, this comment is also comes from a kind of
fastidiousness. I'm satisified not to fixed if you think so.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Michael Paquier (#25)

Re: WAL replay bugs

Hello,

At Thu, 3 Jul 2014 14:48:50 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQ2y3QkapAsaC6oXXQTWbVkkxCrfTuA0w+DX-j3i-LByQ@mail.gmail.com>

TODO

...

Each type of page can be confirmed by the following way *if*
its type is previously *hinted* except for gin.

btree : 32bit magic at pd->opaque
gin : No magic
gist : 16-bit magic at ((GISTPageOpaque*)pd->opaque)->gist_page_id
spgist : 16-bit magic at ((SpGistPageOpaque*)pd->opaque)->spgist_page_id
hash : 16-bit magic at ((HashPageOpaque*)pd->paque)->hasho_page_id
sequence : 16-bit magic at pd->opaque, the last 2 bytes of the page.

# Is it comprehensive? and correct?

Sequence pages use the last 4 bytes. Have a look at sequence_magic in
sequence.c.
For btree pages we can use the last 2 bytes and a check on MAX_BT_CYCLE_ID.
For gin, I'll investigate if it is possible to add a identifier like
GIN_PAGE_ID, it would make the page analysis more consistent with the
others. I am not sure for what the 8 bytes allocated for the special
area are used now for though.

The majority is 16-bit magic at the TAIL of opaque struct. If
we can unify them into , say, 16-bit magic at
*(uint16*)(pd->opaque) the sizeof() labyrinth become stable and
simple and other tools which should identify the type of pages
will be happy. Do you think we can do that?

Yes I think so. I'll raise a different thread though as this is a
different problem that what this patch is targeting. I would even
imagine a macro in bufpage.c able to handle that well.

Ok, that being the case, this topic should be stashed and I'll
look into there regardless of it. Thank you.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#30)

2 attachment(s)

Re: WAL replay bugs

Hello, The new patch looks good for me.

The usage of this is a bit irregular as a (extension) module but
it is the nature of this 'contrib'. The rearranged page type
detection logic seems neater and keeps to work as intended. This
logic will be changed after the new page type detection scheme
becomes ready by the another patch.

I have some additional comments, which should be the last
ones. All of the comments are about test.sh.

- A question mark seems missing from the end of the message "Has
build been done with -DBUFFER_CAPTURE included in CFLAGS" in
test.sh.

- "make check" runs "regress --use-existing" but IMHO the make
target which is expected to do that is installcheck. I had
fooled to convince that it should run the postgres which is
built dedicatedly:(

- postgres processes are left running after
test_default(custom).sh has stopped halfway. This can be fixed
with the attached patch, but, to be honest, this seems too
much. I'll follow your decision whether or not to do this.
(bufcapt_test_sh_20140710.patch)

- test_default.sh is not only an example script which will run
while utilize this facility, but the test script for this
facility itself.

So I think it would be better be added some queries so that all
possible page types available for the default testing. What do
you think about the attached patch? (hash index is unlogged
but I dared to put it for clarity.)

(bufcapt_test_default_sh_20140710.patch)

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

bufcapt_test_sh_20140710.patchtext/x-patch; charset=us-asciiDownload

diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
index 89740bb..ba5e290 100644
--- a/contrib/buffer_capture_cmp/test.sh
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -117,16 +117,27 @@ pg_ctl -w -D $TEST_STANDBY start
 # Check the presence of custom tests and kick them in priority. If not,
 # fallback to the default tests. Tests need only to be run on the master
 # node.
+
 if [ -f ./test-custom.sh ]; then
-	. ./test-custom.sh
+	TEST_SCRIPT=test-custom.sh
 else
-	. ./test-default.sh
+	TEST_SCRIPT=test-default.sh
 fi
 
+set +e
+bash -e $TEST_SCRIPT
+EXITSTATUS=$?
+set -e
+
 # Time to stop the nodes as tests have run
 pg_ctl -w -D $TEST_MASTER stop
 pg_ctl -w -D $TEST_STANDBY stop
 
+if [ $EXITSTATUS != 0 ]; then
+	echo "$TEST_SCRIPT exited by error"
+	exit 1;
+fi
+
 DIFF_FILE=capture_differences.txt
 
 # Check if the capture files exist. If not, build may have not been

bufcapt_test_default_sh_20140710.patchtext/x-patch; charset=us-asciiDownload

diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
index 5bec503..24091ff 100644
--- a/contrib/buffer_capture_cmp/test-default.sh
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -11,4 +11,16 @@
 # cd $ROOT_DIR
 
 # Create a simple table
-psql -c 'CREATE TABLE aa AS SELECT generate_series(1, 10) AS a'
+psql -c 'CREATE TABLE tbtree AS SELECT generate_series(1, 10) AS a'
+psql -c 'CREATE INDEX i_tbtree ON tbtree USING btree(a)'
+psql -c 'CREATE TABLE thash AS SELECT generate_series(1, 10) AS a'
+psql -c 'CREATE INDEX i_thash ON thash USING hash(a)'
+psql -c 'CREATE TABLE tgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a'
+psql -c 'CREATE INDEX i_tgist ON tgist USING gist(p1)'
+psql -c 'CREATE TABLE tspgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a'
+psql -c 'CREATE INDEX i_tspgist ON tspgist USING spgist(p1)'
+psql -c 'CREATE TABLE tgin AS SELECT ARRAY[a/10, a%10] as a1 from generate_series(0, 10) a'
+psql -c 'CREATE INDEX i_tgin ON tgin USING gin(a1)'
+psql -c 'CREATE SEQUENCE sq1'
+
+

#33

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#32)

3 attachment(s)

Re: WAL replay bugs

Updated patches attached.

On Thu, Jul 10, 2014 at 7:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The usage of this is a bit irregular as a (extension) module but
it is the nature of this 'contrib'. The rearranged page type
detection logic seems neater and keeps to work as intended. This
logic will be changed after the new page type detection scheme
becomes ready by the another patch.

No disk format changes will be allowed just to make page detection
easier (Tom mentioned that earlier in this thread btw). We'll have to
live with what current code offers, especially considering that adding
new bytes for page detection for gin pages would double the size of
its special area after applying MAXALIGN to it.

- A question mark seems missing from the end of the message "Has
build been done with -DBUFFER_CAPTURE included in CFLAGS" in
test.sh.

Fixed.

- "make check" runs "regress --use-existing" but IMHO the make
target which is expected to do that is installcheck. I had
fooled to convince that it should run the postgres which is
built dedicatedly:(

Yes, the patch is abusing of that. --use-existing is specified in this
case because the test itself is controlling Postgres servers to build
and fetch the buffer captures. This allows more flexible machinery
IMO.

- postgres processes are left running after
test_default(custom).sh has stopped halfway. This can be fixed
with the attached patch, but, to be honest, this seems too
much. I'll follow your decision whether or not to do this.
(bufcapt_test_sh_20140710.patch)

I had considered that first, thinking that it was the user
responsibility if things are screwed up with his custom scripts. I
guess that the way you are doing it is a safeguard simple enough
though, so included with some editing, particularly reporting to the
user the error code returned by the test script.

- test_default.sh is not only an example script which will run
while utilize this facility, but the test script for this
facility itself.
So I think it would be better be added some queries so that all
possible page types available for the default testing. What do
you think about the attached patch? (hash index is unlogged
but I dared to put it for clarity.)

Yeah, having a default set of queries run just to let the user get an
idea of how it works improves things.
Once again thanks for taking the time to look at that.

Regards,
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-and-sequence-page-opaque-data-to-sequ.patchtext/x-patch; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-and-sequence-page-opaque-data-to-sequ.patchDownload

From ce66e24081ead2cb42b02f007039287947d0cca6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 4 Jul 2014 13:34:47 +0900
Subject: [PATCH 1/3] Move SEQ_MAGIC and sequence page opaque data to
 sequence.h

This can allow a backend process to detect if a page is being used
for a sequence.
---
 src/backend/commands/sequence.c | 34 ++++++++++++----------------------
 src/include/commands/sequence.h | 13 +++++++++++++
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..802aac7 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -46,16 +46,6 @@
 #define SEQ_LOG_VALS	32
 
 /*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
-typedef struct sequence_magic
-{
-	uint32		magic;
-} sequence_magic;
-
-/*
  * We store a SeqTable item for every sequence we have touched in the current
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
@@ -306,7 +296,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 {
 	Buffer		buf;
 	Page		page;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	OffsetNumber offnum;
 
 	/* Initialize first page of relation with special magic number */
@@ -316,9 +306,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	page = BufferGetPage(buf);
 
-	PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
-	sm->magic = SEQ_MAGIC;
+	PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	/* Now insert sequence tuple */
 
@@ -1066,18 +1056,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
 {
 	Page		page;
 	ItemId		lp;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	Form_pg_sequence seq;
 
 	*buf = ReadBuffer(rel, 0);
 	LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buf);
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
 
-	if (sm->magic != SEQ_MAGIC)
+	if (sm->seq_page_id != SEQ_MAGIC)
 		elog(ERROR, "bad magic number in sequence \"%s\": %08X",
-			 RelationGetRelationName(rel), sm->magic);
+			 RelationGetRelationName(rel), sm->seq_page_id);
 
 	lp = PageGetItemId(page, FirstOffsetNumber);
 	Assert(ItemIdIsNormal(lp));
@@ -1541,7 +1531,7 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	char	   *item;
 	Size		itemsz;
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 
 	/* Backup blocks are not used in seq records */
 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
@@ -1564,9 +1554,9 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	 */
 	localpage = (Page) palloc(BufferGetPageSize(buffer));
 
-	PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(localpage);
-	sm->magic = SEQ_MAGIC;
+	PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	item = (char *) xlrec + sizeof(xl_seq_rec);
 	itemsz = record->xl_len - sizeof(xl_seq_rec);
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 8819c00..2455878 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,19 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * Page opaque data in a sequence page
+ */
+typedef struct SequencePageOpaqueData
+{
+	uint32 seq_page_id;
+} SequencePageOpaqueData;
+
+/*
+ * This page ID is for the conveniende to be able to identify if a page
+ * is being used by a sequence.
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
-- 
2.0.1

0002-Extract-generic-bash-initialization-process-from-pg_.patchtext/x-patch; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_.patchDownload

From 994c58197e876458c2f4cf4cbf95240a938910ec Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 4 Jul 2014 14:12:11 +0900
Subject: [PATCH 2/3] Extract generic bash initialization process from
 pg_upgrade

Such initialization is useful as well for some other utilities and makes
test settings consistent.
---
 contrib/pg_upgrade/test.sh | 47 +++--------------------------------
 src/test/shell/init_env.sh | 61 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 43 deletions(-)
 create mode 100644 src/test/shell/init_env.sh

diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 7bbd2c7..2e1c61a 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -9,24 +9,14 @@
 # Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 
-set -e
-
-: ${MAKE=make}
-
-# Guard against parallel make issues (see comments in pg_regress.c)
-unset MAKEFLAGS
-unset MAKELEVEL
-
-# Establish how the server will listen for connections
-testhost=`uname -s`
+# Initialize test
+. ../../src/test/shell/init_env.sh
 
 case $testhost in
 	MINGW*)
-		LISTEN_ADDRESSES="localhost"
 		PGHOST=""; unset PGHOST
 		;;
 	*)
-		LISTEN_ADDRESSES=""
 		# Select a socket directory.  The algorithm is from the "configure"
 		# script; the outcome mimics pg_regress.c:make_temp_sockdir().
 		PGHOST=$PG_REGRESS_SOCK_DIR
@@ -102,37 +92,8 @@ logdir=$PWD/log
 rm -rf "$logdir"
 mkdir "$logdir"
 
-# Clear out any environment vars that might cause libpq to connect to
-# the wrong postmaster (cf pg_regress.c)
-#
-# Some shells, such as NetBSD's, return non-zero from unset if the variable
-# is already unset. Since we are operating under 'set -e', this causes the
-# script to fail. To guard against this, set them all to an empty string first.
-PGDATABASE="";        unset PGDATABASE
-PGUSER="";            unset PGUSER
-PGSERVICE="";         unset PGSERVICE
-PGSSLMODE="";         unset PGSSLMODE
-PGREQUIRESSL="";      unset PGREQUIRESSL
-PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
-PGHOSTADDR="";        unset PGHOSTADDR
-
-# Select a non-conflicting port number, similarly to pg_regress.c
-PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$newsrc"/src/include/pg_config.h | awk '{print $3}'`
-PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
-export PGPORT
-
-i=0
-while psql -X postgres </dev/null 2>/dev/null
-do
-	i=`expr $i + 1`
-	if [ $i -eq 16 ]
-	then
-		echo port $PGPORT apparently in use
-		exit 1
-	fi
-	PGPORT=`expr $PGPORT + 1`
-	export PGPORT
-done
+# Get a port to run the tests
+pg_get_test_port "$newsrc"
 
 # buildfarm may try to override port via EXTRA_REGRESS_OPTS ...
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --port=$PGPORT"
diff --git a/src/test/shell/init_env.sh b/src/test/shell/init_env.sh
new file mode 100644
index 0000000..e4d6cdb
--- /dev/null
+++ b/src/test/shell/init_env.sh
@@ -0,0 +1,61 @@
+#!/bin/sh
+
+# src/test/shell/init.sh
+#
+# Utility initializing environment for tests to be conducted in shell.
+# The initialization done here is made to be platform-proof.
+
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*)	LISTEN_ADDRESSES="localhost" ;;
+	*)		LISTEN_ADDRESSES="" ;;
+esac
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return nonzero from unset if the variable
+# is already unset. Since we are operating under 'set e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+# Select a non-conflicting port number, similarly to pg_regress.c, and
+# save its value in PGPORT. Caller should either save or use this value
+# for the tests.
+pg_get_test_port()
+{
+	PG_ROOT_DIR=$1
+	PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$PG_ROOT_DIR"/src/include/pg_config.h | awk '{print $3}'`
+	PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+	PGPORT_START=$PGPORT
+	export PGPORT
+
+	i=0
+	while psql -X postgres </dev/null 2>/dev/null
+	do
+		i=`expr $i + 1`
+		if [ $i -eq 16 ]
+		then
+			echo "Ports from $PGPORT_START to $PGPORT are apparently in use"
+			exit 1
+		fi
+		PGPORT=`expr $PGPORT + 1`
+		export PGPORT
+	done
+}
-- 
2.0.1

0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/x-patch; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload

From f71458981ad672ea7950494fc01bc9dd19df33d4 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 11 Jul 2014 09:46:21 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines that
can be used to check for consistency at page level when replaying WAL
files among several nodes of a cluster (generally master and standby
node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then each
buffer is captured is with the following format as a single line of
the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between pages,
and format is chosen to facilitate comparison between buffer entries.
- A client part, located in contrib/buffer_capture_cmp, that can be used
to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a CFLAGS
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in test-default.sh
but user is free to set up custom tests by creating a file called
test-custom.sh that can be kicked by the test facility if this file
is present instead of the defaults.
---
 contrib/Makefile                                |   1 +
 contrib/buffer_capture_cmp/.gitignore           |   9 +
 contrib/buffer_capture_cmp/Makefile             |  35 ++
 contrib/buffer_capture_cmp/README               |  33 ++
 contrib/buffer_capture_cmp/buffer_capture_cmp.c | 327 +++++++++++++++
 contrib/buffer_capture_cmp/test-default.sh      |  26 ++
 contrib/buffer_capture_cmp/test.sh              | 184 +++++++++
 src/backend/access/heap/heapam.c                |  11 +
 src/backend/storage/buffer/Makefile             |   2 +-
 src/backend/storage/buffer/bufcapt.c            | 504 ++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c             |  40 +-
 src/backend/storage/lmgr/lwlock.c               |   8 +
 src/backend/storage/page/bufpage.c              |   1 -
 src/include/commands/sequence.h                 |   2 +-
 src/include/miscadmin.h                         |  13 +
 src/include/storage/bufcapt.h                   |  66 ++++
 16 files changed, 1258 insertions(+), 4 deletions(-)
 create mode 100644 contrib/buffer_capture_cmp/.gitignore
 create mode 100644 contrib/buffer_capture_cmp/Makefile
 create mode 100644 contrib/buffer_capture_cmp/README
 create mode 100644 contrib/buffer_capture_cmp/buffer_capture_cmp.c
 create mode 100644 contrib/buffer_capture_cmp/test-default.sh
 create mode 100644 contrib/buffer_capture_cmp/test.sh
 create mode 100644 src/backend/storage/buffer/bufcapt.c
 create mode 100644 src/include/storage/bufcapt.h

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..1c8e6b9 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		btree_gin	\
 		btree_gist	\
+		buffer_capture_cmp \
 		chkpass		\
 		citext		\
 		cube		\
diff --git a/contrib/buffer_capture_cmp/.gitignore b/contrib/buffer_capture_cmp/.gitignore
new file mode 100644
index 0000000..ecd8b78
--- /dev/null
+++ b/contrib/buffer_capture_cmp/.gitignore
@@ -0,0 +1,9 @@
+# Binary generated
+/buffer_capture_cmp
+
+# Regression tests
+/capture_differences.txt
+/tmp_check
+
+# Custom test file
+/test-custom.sh
diff --git a/contrib/buffer_capture_cmp/Makefile b/contrib/buffer_capture_cmp/Makefile
new file mode 100644
index 0000000..f4d2d8d
--- /dev/null
+++ b/contrib/buffer_capture_cmp/Makefile
@@ -0,0 +1,35 @@
+# contrib/buffer_capture_cmp/Makefile
+
+PGFILEDESC = "buffer_capture_cmp - Comparator tool between buffer captures"
+PGAPPICON = win32
+
+PROGRAM = buffer_capture_cmp
+OBJS	= buffer_capture_cmp.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir) -DFRONTEND
+PG_LIBS = $(libpq_pgport)
+
+EXTRA_CLEAN = tmp_check/ capture_differences.txt
+
+# test.sh creates a cluster dedicated for the test
+EXTRA_REGRESS_OPTS = --use-existing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/buffer_capture_cmp
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# Tests can be done only when BUFFER_CAPTURE is defined
+ifneq (,$(filter -DBUFFER_CAPTURE,$(CFLAGS)))
+check: test.sh all
+	MAKE=$(MAKE) bindir=$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) ./test.sh
+else
+check:
+	echo "BUFFER_CAPTURE is not defined in CFLAGS, so no tests to run"
+endif
diff --git a/contrib/buffer_capture_cmp/README b/contrib/buffer_capture_cmp/README
new file mode 100644
index 0000000..8a0a154
--- /dev/null
+++ b/contrib/buffer_capture_cmp/README
@@ -0,0 +1,33 @@
+buffer_capture_cmp
+------------------
+
+This facility already contains a set of regression tests that can be run
+by default with the following command:
+
+    make check
+
+The regression scripts contain a hook that can be used as an entry point
+to run some custom tests using this facility. Simply create in this folder
+a file called test-custom.sh and execute all the commands necessary for the
+tests. This custom script should use PGPORT as port to connect to the
+server where tests are run.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+It is possible to use pg_xlogdump to see which WAL record a page image
+corresponds to. But beware that the LSN in the page image points to the
+*end* of the WAL record, while the LSN that pg_xlogdump prints is the
+*beginning* of the WAL record. So to find which WAL record a page image
+corresponds to, find the LSN from the page image in pg_xlogdump output,
+and back off one record. (you can't just grep for the line containing
+the LSN).
diff --git a/contrib/buffer_capture_cmp/buffer_capture_cmp.c b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
new file mode 100644
index 0000000..edf9bed
--- /dev/null
+++ b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
@@ -0,0 +1,327 @@
+/*-------------------------------------------------------------------------
+ *
+ * buffer_capture_cmp.c
+ *	  Utility tool to compare buffer captures between two nodes of
+ *	  a cluster. This utility needs to be run on servers whose code
+ *	  has been built with the symbol BUFFER_CAPTURE defined.
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    contrib/buffer_capture_cmp/buffer_capture_cmp.c
+ *
+ * Capture files that can be obtained on nodes of a cluster do not
+ * necessarily have the page images logged in the same order. For
+ * example, if a single WAL-logged operation modifies multiple pages,
+ * like an index page split, the standby might release the locks
+ * in different order than the master. Another cause is concurrent
+ * operations; writing the page images is not atomic with WAL insertion,
+ * so if two backends are running concurrently, their modifications in
+ * the image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines from the capture file are read into a reorder
+ * buffer, and sorted there. Sorting the whole file would be overkill,
+ * as the lines are mostly in order already. The fixed-size reorder
+ * buffer works as long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ */
+
+#include "postgres_fe.h"
+#include "port.h"
+
+/* Size of a single entry of the capture file */
+#define LINESZ (BLCKSZ*2 + 31)
+
+/* Line reordering stuff */
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	   *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* Number of lines currently in buffer */
+
+	FILE	   *fp;
+	bool		eof;		/* Has EOF been reached from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the capture file into the reorder buffer, until the
+ * buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+
+		/*
+		 * Common case, entry goes to the end. This happens for an
+		 * initialization or when buffer compares to be higher than
+		 * the last buffer in queue.
+		 */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* Find the right place in the queue */
+			int			i;
+
+			/*
+			 * Scan all the existing buffers and find where buffer needs
+			 * to be included. We already know that the comparison result
+			 * we the last buffer in list.
+			 */
+			for (i = buf->nlines - 1; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+
+			/* Place buffer correctly in the list */
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+/*
+ * Initialize a reorder buffer.
+ */
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		/* Fetch LSN from the first 8 bytes of the buffer */
+		sscanf(buf->lines[nlines], "LSN: %08X/%08X",
+			   &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* Consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+	return nlines;
+}
+
+/*
+ * Free all the given records.
+ */
+static void
+freerecord(char **lines, int nlines)
+{
+	int                     i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+/*
+ * Print out given records.
+ */
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+/*
+ * Do a direct comparison between the two given records that have the
+ * same LSN entry.
+ */
+static bool
+diffrecord(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	/* Leave if they do not have the same number of records */
+	if (nlines_a != nlines_b)
+		return true;
+
+	/*
+	 * Now do a comparison line by line. If any diffs are found at
+	 * character-level they will be printed out.
+	 */
+	for (i = 0; i < nlines_a; i++)
+	{
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			int strlen_a = strlen(lines_a[i]);
+			int strlen_b = strlen(lines_b[i]);
+			int j;
+
+			printf("strlen_a: %d, strlen_b: %d\n",
+				   strlen_a, strlen_b);
+			for (j = 0; j < strlen_a; j++)
+			{
+				char char_a = lines_a[i][j];
+				char char_b = lines_b[i][j];
+				if (char_a != char_b)
+					printf("position: %d, char_a: %c, char_b: %c\n",
+						   j, char_a, char_b);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's capture file> <standby's capture file>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	char	   *path_a, *path_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	/* Open first file */
+	path_a = pg_strdup(argv[1]);
+	fp_a = fopen(path_a, "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_a);
+		exit(1);
+	}
+
+	/* Open second file */
+	path_b = pg_strdup(argv[2]);
+	fp_b = fopen(path_b, "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_b);
+		exit(1);
+	}
+
+	/* Initialize buffers for first loop */
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* Read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	/* Do comparisons as long as there are entries */
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* Compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecord(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("Lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
new file mode 100644
index 0000000..914c8a6
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# Default test suite for buffer_compare_cmp
+
+# PGPORT is already set so process should refer to that when kicking tests
+
+# In order to run the regression test suite, copy this file as test-custom.sh
+# and then uncomment the following lines:
+# echo ROOT_DIR=$PWD
+# psql -c 'CREATE DATABASE regression'
+# cd ../../src/test/regress && make installcheck 2>&1 > /dev/null
+# cd $ROOT_DIR
+
+# Run a simple set of queries
+psql --no-psqlrc <<EOF
+CREATE TABLE tbtree AS SELECT generate_series(1, 10) AS a;
+CREATE INDEX i_tbtree ON tbtree USING btree(a);
+CREATE TABLE thash AS SELECT generate_series(1, 10) AS a;
+CREATE INDEX i_thash ON thash USING hash(a);
+CREATE TABLE tgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a;
+CREATE INDEX i_tgist ON tgist USING gist(p1);
+CREATE TABLE tspgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a;
+CREATE INDEX i_tspgist ON tspgist USING spgist(p1);
+CREATE TABLE tgin AS SELECT ARRAY[a/10, a%10] as a1 from generate_series(0, 10) a;
+CREATE INDEX i_tgin ON tgin USING gin(a1);
+CREATE SEQUENCE sq1;
+EOF
diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
new file mode 100644
index 0000000..6563811
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -0,0 +1,184 @@
+#!/bin/bash
+
+# contrib/buffer_capture_cmp/test.sh
+#
+# Test driver for buffer_capture_cmp. It does the following processing:
+#
+# 1) Initialization of a master and a standby
+# 2) Stop master, then standby
+# 3) Remove $PGDATA/buffer_capture on master and standby
+# 4) Start master, then standby
+# 5) Run custom or default series of tests
+# 6) Stop master, then standby
+# 7) Compare the buffer capture of both nodes
+# 8) The diffence file should be empty
+#
+# Portions Copyright (c) 2006-2014, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+
+# Initialize environment
+. ../../src/test/shell/init_env.sh
+
+# Adjust these paths for your environment
+TESTROOT=$PWD/tmp_check
+TEST_MASTER=$TESTROOT/data_master
+TEST_STANDBY=$TESTROOT/data_standby
+
+# Create the root folder for test data
+if [ ! -d $TESTROOT ]; then
+	mkdir -p $TESTROOT
+fi
+
+export PGDATABASE="postgres"
+
+# Set up PATH correctly
+PATH=$bindir:$PATH
+export PATH
+
+# Get port values for master node
+pg_get_test_port ../..
+PG_MASTER_PORT=$PGPORT
+
+# Enable echo so the user can see what is being executed
+set -x
+
+# Fetch file name containing buffer captures
+CAPTURE_FILE_NAME=`grep '#define BUFFER_CAPTURE_FILE' "$PG_ROOT_DIR"/src/include/storage/bufcapt.h | awk '{print $3}' | sed 's/\"//g'`
+CAPTURE_FILE_MASTER=$TEST_MASTER/$CAPTURE_FILE_NAME
+CAPTURE_FILE_STANDBY=$TEST_STANDBY/$CAPTURE_FILE_NAME
+
+# Set up the nodes, first the master
+rm -rf $TEST_MASTER
+initdb -N -A trust -D $TEST_MASTER
+
+# Custom parameters for master's postgresql.conf
+cat >> $TEST_MASTER/postgresql.conf <<EOF
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+checkpoint_segments = 50
+shared_buffers = 1MB
+log_line_prefix = 'M  %m %p '
+hot_standby = on
+autovacuum = off
+max_connections = 50
+listen_addresses = '$LISTEN_ADDRESSES'
+port = $PG_MASTER_PORT
+EOF
+
+# Accept replication connections on master
+cat >> $TEST_MASTER/pg_hba.conf <<EOF
+local replication all trust
+host replication all 127.0.0.1/32 trust
+host replication all ::1/128 trust
+EOF
+
+# Start master
+pg_ctl -w -D $TEST_MASTER start
+
+# Now the standby
+echo "Master initialized and running."
+
+# Set up standby with necessary parameters
+rm -rf $TEST_STANDBY
+
+# Base backup is taken with xlog files included
+pg_basebackup -D $TEST_STANDBY -p $PG_MASTER_PORT -x
+
+# Get a fresh port value for the standby
+pg_get_test_port ../..
+PG_STANDBY_PORT=$PGPORT
+echo standby: $PG_STANDBY_PORT master: $PG_MASTER_PORT
+echo "port = $PG_STANDBY_PORT" >> $TEST_STANDBY/postgresql.conf
+
+# Still need to set up PGPORT for subsequent tests
+PGPORT=$PG_MASTER_PORT
+export PGPORT
+
+cat > $TEST_STANDBY/recovery.conf <<EOF
+primary_conninfo='port=$PG_MASTER_PORT'
+standby_mode=on
+recovery_target_timeline='latest'
+EOF
+
+# Start standby
+pg_ctl -w -D $TEST_STANDBY start
+
+# Stop both nodes and remove the file containing the buffer captures
+# Master needs to be stopped first.
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+rm -rf $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY
+
+# Re-start the nodes
+pg_ctl -w -D $TEST_MASTER start
+pg_ctl -w -D $TEST_STANDBY start
+
+# Check the presence of custom tests and kick them in priority. If not,
+# fallback to the default tests. Tests need only to be run on the master
+# node.
+if [ -f ./test-custom.sh ]; then
+	TEST_SCRIPT=test-custom.sh
+else
+	TEST_SCRIPT=test-default.sh
+fi
+
+# In case of an error from the test script, wait that clusters have
+# been stopped before reporting anything, we do not want nodes still
+# running after this test particularly if user has done some stupid
+# things with a custom script.
+set +e
+${SHELL} -e $TEST_SCRIPT
+EXITSTATUS=$?
+set -e
+
+# Time to stop the nodes as tests have run
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+
+if [ $EXITSTATUS != 0 ]; then
+	echo "$TEST_SCRIPT exited with code $EXITSTATUS"
+	exit $EXITSTATUS;
+fi
+
+DIFF_FILE=capture_differences.txt
+
+# Check if the capture files exist. If not, build may have not been
+# done with BUFFER_CAPTURE enabled.
+if [ ! -f $CAPTURE_FILE_MASTER ]; then
+	echo "Capture file $CAPTURE_FILE_MASTER is missing on master"
+	echo "Has build been done with -DBUFFER_CAPTURE included in CFLAGS?"
+	exit 0
+fi
+if [ ! -f $CAPTURE_FILE_STANDBY ]; then
+	echo "Capture file $CAPTURE_FILE_STANDBY is missing on standby"
+	echo "Has build been done with -DBUFFER_CAPTURE included in CFLAGS"
+	exit 0
+fi
+
+# Now compare the buffer images
+# Disable erroring here, buffer capture file may not be present
+# if cluster has not been built with symbol BUFFER_CAPTURE
+set +e
+./buffer_capture_cmp $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY > $DIFF_FILE
+ERR_NUM=$?
+
+# Leave on error
+if [ $ERR_NUM == 1 ]; then
+	echo "FAILED"
+	exit 1
+fi
+
+# No need to echo commands anymore
+set +x
+echo
+
+# Test passes if there are no diffs!
+if [ ! -s $DIFF_FILE ]; then
+	echo "PASSED"
+    exit 0
+else
+	echo "Diffs exist in the buffer captures"
+	echo "FAILED"
+	exit 1
+fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6861ae0..d43406c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,9 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -7010,6 +7013,14 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+#ifdef BUFFER_CAPTURE
+	/*
+	 * The normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	buffer_capture_write(page, blkno);
+#endif
+
 	return recptr;
 }
 
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..6ec85b0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufcapt.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufcapt.c b/src/backend/storage/buffer/bufcapt.c
new file mode 100644
index 0000000..57c8c6e
--- /dev/null
+++ b/src/backend/storage/buffer/bufcapt.c
@@ -0,0 +1,504 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.c
+ *	  Routines for buffer capture, including masking and dynamic buffer
+ *	  snapshot.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufcapt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufcapt.h"
+#include "storage/bufmgr.h"
+#include "storage/lwlock.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER				0xFF
+
+/* Support for capturing changes to pages per process */
+#define MAX_BEFORE_IMAGES		100
+
+typedef struct
+{
+	Buffer		  buffer;
+	char			content[BLCKSZ];
+} BufferImage;
+
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+static FILE *imagefp;
+static int before_images_cnt = 0;
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ */
+static void mask_unused_space(char *page);
+static void mask_heap_page(char *page);
+static void mask_spgist_page(char *page);
+static void mask_gist_page(char *page);
+static void mask_gin_page(BlockNumber blkno, char *page);
+static void mask_sequence_page(char *page);
+static void mask_btree_page(char *page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page)
+{
+	int		 pd_lower = ((PageHeader) page)->pd_lower;
+	int		 pd_upper = ((PageHeader) page)->pd_upper;
+	int		 pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(char *page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	  iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int	 len = ItemIdGetLength(iid);
+			int	 padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(BlockNumber blkno, char *page)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(char *page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(char *page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+		(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be better with more refinement.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/* ----------------------------------------------------------------
+ * Buffer capture functions.
+ *
+ * Those functions can be used to memorize the content of pages
+ * and flush them to BUFFER_CAPTURE_FILE when necessary.
+ */
+static bool
+lsn_is_updated(BufferImage *img)
+{
+	Page			newcontent = BufferGetPage(img->buffer);
+	Page			oldcontent = (Page) img->content;
+
+	if (PageGetLSN(oldcontent) == PageGetLSN(newcontent))
+		return false;
+	return true;
+}
+
+/*
+ * Initialize page capture
+ */
+void
+buffer_capture_init(void)
+{
+	int				i;
+	BufferImage	   *images;
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen(BUFFER_CAPTURE_FILE, "ab");
+}
+
+/*
+ * buffer_capture_reset
+ *
+ * Reset buffer captures
+ */
+void
+buffer_capture_reset(void)
+{
+	if (before_images_cnt > 0)
+		elog(LOG, "Released all buffer captures");
+	before_images_cnt = 0;
+}
+
+/*
+ * buffer_capture_write
+ *
+ * Flush to file the new content page present here after applying a
+ * mask on it.
+ */
+void
+buffer_capture_write(char *newcontent,
+					 uint32 blkno)
+{
+	XLogRecPtr	newlsn = PageGetLSN((Page) newcontent);
+	char		page[BLCKSZ];
+	uint16		tail;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	/* Copy content of page before any operation */
+	memcpy(page, newcontent, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		/* Case of a normal relation, it has an empty special area */
+		mask_heap_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) &&
+			 tail == GIST_PAGE_ID)
+	{
+		/* Gist page */
+		mask_gist_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) &&
+			 tail <= MAX_BT_CYCLE_ID)
+	{
+		/* btree page */
+		mask_btree_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) &&
+			 tail == SPGIST_PAGE_ID)
+	{
+		/* SpGist page */
+		mask_spgist_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(SequencePageOpaqueData)))
+	{
+		/*
+		 * The page found here is used either for a Gin index or a sequence.
+		 * Gin index pages do not have a proper identifier, so check if the page
+		 * is used by a sequence or not. If it is not the case, this page is used
+		 * by a gin index. It is still possible that a gin page covers with area
+		 * with exactly the same value as SEQ_MAGIC, but this is unlikely to happen.
+		 */
+		if (((SequencePageOpaqueData *) PageGetSpecialPointer(page))->seq_page_id == SEQ_MAGIC)
+			mask_sequence_page(page);
+		else
+			mask_gin_page(blkno, page);
+	}
+	else
+	{
+		/* Should not come here */
+		Assert(0);
+	}
+
+	/*
+	 * First write the LSN in a constant format to facilitate comparisons
+	 * between buffer captures.
+	 */
+	fprintf(imagefp, "LSN: %08X/%08X ",
+			(uint32) (newlsn >> 32), (uint32) newlsn);
+
+	/* Then write the page contents, in hex */
+	fprintf(imagefp, "page: ");
+	{
+		char	buf[BLCKSZ * 2];
+		int     j = 0;
+		int		i;
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte = (uint8) page[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+	fprintf(imagefp, "\n");
+
+	/* Then the masked page in hex format */
+	fflush(imagefp);
+
+	/* Clean up */
+	LWLockRelease(SyncScanLock);
+}
+
+/*
+ * buffer_capture_remember
+ *
+ * Append a page content to the existing list of buffers to-be-captured.
+ */
+void
+buffer_capture_remember(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy(img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * buffer_capture_forget
+ *
+ * Forget a page image. If the page was modified, log the new contents.
+ */
+void
+buffer_capture_forget(Buffer buffer)
+{
+	int			i;
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	Page		content;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+			/* If page has new content, capture it */
+			if (lsn_is_updated(img))
+			{
+				content = BufferGetPage(img->buffer);
+				BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+				buffer_capture_write(content, blkno);
+			}
+
+			if (i != before_images_cnt)
+			{
+				/* Swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+
+			before_images_cnt--;
+			return;
+		}
+	}
+
+	/* Gather some information about this buffer image not found */
+	content = BufferGetPage(buffer);
+	BufferGetTag(buffer, &rnode, &forknum, &blkno);
+	elog(LOG, "could not find image for buffer %u: LSN %X/%08X rnode %u, "
+			  "forknum %u, blkno %u",
+		 ((PageHeader) content)->pd_lsn.xlogid,
+		 ((PageHeader) content)->pd_lsn.xrecoff,
+		 buffer, rnode.relNode, forknum, blkno);
+}
+
+/*
+ * buffer_capture_scan
+ *
+ * See if any of the buffers that have been memorized have changed and
+ * update them if it is the case.
+ */
+void
+buffer_capture_scan(void)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (lsn_is_updated(img))
+		{
+			Page content = BufferGetPage(img->buffer);
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+
+			BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+			buffer_capture_write(content, blkno);
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
+
+#endif /* BUFFER_CAPTURE */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..4624a99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -50,6 +50,9 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -1708,6 +1711,10 @@ AtEOXact_Buffers(bool isCommit)
 {
 	CheckForBufferLeaks();
 
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1724,6 +1731,10 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef BUFFER_CAPTURE
+	buffer_capture_init();
+#endif
 }
 
 /*
@@ -2749,6 +2760,20 @@ LockBuffer(Buffer buffer, int mode)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * Peek into the LWLock struct to see if we're holding it in exclusive
+		 * or shared mode. This is concurrency-safe: if we're holding it in
+		 * exclusive mode, no-one else can release it. If we're holding it
+		 * in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			buffer_capture_forget(buffer);
+	}
+#endif
+
 	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
@@ -2757,6 +2782,11 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		buffer_capture_remember(buffer);
+#endif
 }
 
 /*
@@ -2768,6 +2798,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	acquired;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2775,7 +2806,14 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	acquired = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+
+#ifdef BUFFER_CAPTURE
+	if (acquired)
+		buffer_capture_remember(buffer);
+#endif
+
+	return acquired;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a62af27..4671a55 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -41,6 +41,10 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
+
 #ifdef LWLOCK_STATS
 #include "utils/hsearch.h"
 #endif
@@ -1272,6 +1276,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..8b3b83c 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -21,7 +21,6 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
-
 /* GUC variable */
 bool		ignore_checksum_failure = false;
 
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 2455878..cbd6780 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -27,7 +27,7 @@ typedef struct SequencePageOpaqueData
 } SequencePageOpaqueData;
 
 /*
- * This page ID is for the conveniende to be able to identify if a page
+ * This page ID is for the convenience to be able to identify if a page
  * is being used by a sequence.
  */
 #define SEQ_MAGIC		0x1717
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..1ae98f7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -116,11 +116,24 @@ do { \
 
 #define START_CRIT_SECTION()  (CritSectionCount++)
 
+#ifdef BUFFER_CAPTURE
+/* in src/backend/storage/buffer/bufcapt.c */
+void buffer_capture_scan(void);
+
+#define END_CRIT_SECTION() \
+do { \
+	Assert(CritSectionCount > 0); \
+	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		buffer_capture_scan(); \
+} while(0)
+#else
 #define END_CRIT_SECTION() \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
 } while(0)
+#endif
 
 
 /*****************************************************************************
diff --git a/src/include/storage/bufcapt.h b/src/include/storage/bufcapt.h
new file mode 100644
index 0000000..012e5eb
--- /dev/null
+++ b/src/include/storage/bufcapt.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.h
+ *	  Buffer capture definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufcapt.h
+ *
+ * About BUFFER_CAPTURE:
+ *
+ * If this symbol is defined, all page images that are logged on this
+ * server are as well flushed to BUFFER_CAPTURE_FILE. One line of the
+ * output file is used for a single page image.
+ *
+ * The page images obtained are aimed to be used with the utility tool
+ * called buffer_capture_cmp available in contrib/ and can be used to
+ * compare how WAL is replayed between master and standby nodes, helping
+ * in spotting bugs in this area. As each page is written in hexadecimal
+ * format, one single page writes BLCKSZ * 2 bytes to the capture file.
+ * Hexadecimal format makes it easier to spot differences between captures
+ * done among nodes. Be aware that this has a large impact on I/O and that
+ * it is aimed only for buildfarm and development purposes.
+ *
+ * One single page entry has the following format:
+ *	LSN: %08X/08X page: PAGE_IN_HEXA
+ *
+ * The LSN corresponds to the log sequence number to which the page image
+ * applies to, then the content of the image is added as-is. This format
+ * is chosen to facilitate comparisons between each capture entry
+ * particularly in cases where LSN increases its digit number. As unlogged
+ * relations do not have LSN numbers saved, their buffer modifications are
+ * not captured by this facility.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFCAPT_H
+#define BUFCAPT_H
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Output file where buffer captures are stored
+ */
+#define BUFFER_CAPTURE_FILE "buffer_captures"
+
+void buffer_capture_init(void);
+void buffer_capture_reset(void);
+void buffer_capture_write(char *newcontent,
+						  uint32 blkno);
+
+void buffer_capture_remember(Buffer buffer);
+void buffer_capture_forget(Buffer buffer);
+void buffer_capture_scan(void);
+
+#endif /* BUFFER_CAPTURE */
+
+#endif
-- 
2.0.1

#34

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Michael Paquier (#33)

Re: WAL replay bugs

Hello, Let me apologize for continuing the discussion even though
the deadline is approaching.

At Fri, 11 Jul 2014 09:49:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTJEzOz-FotibSEjyG0eaBngx2PLqywoDChYFXzFqYQkg@mail.gmail.com>

Updated patches attached.

On Thu, Jul 10, 2014 at 7:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The usage of this is a bit irregular as a (extension) module but
it is the nature of this 'contrib'. The rearranged page type
detection logic seems neater and keeps to work as intended. This
logic will be changed after the new page type detection scheme
becomes ready by the another patch.

No disk format changes will be allowed just to make page detection
easier (Tom mentioned that earlier in this thread btw). We'll have to
live with what current code offers,
especially considering that adding
new bytes for page detection for gin pages would double the size of
its special area after applying MAXALIGN to it.

That's awkward, but I agree with it. By the way, I found
PageHeaderData.pd_flags to have 9 bits room. It seems to be
usable if no other usage is in sight right now, if the formal
method to identify page types is worth the 3-4 bits there.

# This is a separate discussion from this patch itself.

- "make check" runs "regress --use-existing" but IMHO the make
target which is expected to do that is installcheck. I had
fooled to convince that it should run the postgres which is
built dedicatedly:(

Yes, the patch is abusing of that. --use-existing is specified in this
case because the test itself is controlling Postgres servers to build
and fetch the buffer captures. This allows more flexible machinery
IMO.

Although I doubt necessity of the flexibility seeing the current
testing framework, I don't have so strong objection about
that. Nevertheless, perhaps you are appreciated to put a notice
on.. README or somewhere.

- postgres processes are left running after
test_default(custom).sh has stopped halfway. This can be fixed
with the attached patch, but, to be honest, this seems too
much. I'll follow your decision whether or not to do this.
(bufcapt_test_sh_20140710.patch)

I had considered that first, thinking that it was the user
responsibility if things are screwed up with his custom scripts. I
guess that the way you are doing it is a safeguard simple enough
though, so included with some editing, particularly reporting to the
user the error code returned by the test script.

Agreed.

- test_default.sh is not only an example script which will run
while utilize this facility, but the test script for this
facility itself.
So I think it would be better be added some queries so that all
possible page types available for the default testing. What do
you think about the attached patch? (hash index is unlogged
but I dared to put it for clarity.)

Yeah, having a default set of queries run just to let the user get an
idea of how it works improves things.
Once again thanks for taking the time to look at that.

Thank you.

regardes,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#34)

3 attachment(s)

Re: WAL replay bugs

On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Although I doubt necessity of the flexibility seeing the current
testing framework, I don't have so strong objection about
that. Nevertheless, perhaps you are appreciated to put a notice
on.. README or somewhere.

Hm, well... Fine, I added it in this updated series.

Regards,
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-and-sequence-page-opaque-data-to-sequ.patchtext/plain; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-and-sequence-page-opaque-data-to-sequ.patchDownload

From ce66e24081ead2cb42b02f007039287947d0cca6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 4 Jul 2014 13:34:47 +0900
Subject: [PATCH 1/3] Move SEQ_MAGIC and sequence page opaque data to
 sequence.h

This can allow a backend process to detect if a page is being used
for a sequence.
---
 src/backend/commands/sequence.c | 34 ++++++++++++----------------------
 src/include/commands/sequence.h | 13 +++++++++++++
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e608420..802aac7 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -46,16 +46,6 @@
 #define SEQ_LOG_VALS	32
 
 /*
- * The "special area" of a sequence's buffer page looks like this.
- */
-#define SEQ_MAGIC	  0x1717
-
-typedef struct sequence_magic
-{
-	uint32		magic;
-} sequence_magic;
-
-/*
  * We store a SeqTable item for every sequence we have touched in the current
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
@@ -306,7 +296,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 {
 	Buffer		buf;
 	Page		page;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	OffsetNumber offnum;
 
 	/* Initialize first page of relation with special magic number */
@@ -316,9 +306,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	page = BufferGetPage(buf);
 
-	PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
-	sm->magic = SEQ_MAGIC;
+	PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	/* Now insert sequence tuple */
 
@@ -1066,18 +1056,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
 {
 	Page		page;
 	ItemId		lp;
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 	Form_pg_sequence seq;
 
 	*buf = ReadBuffer(rel, 0);
 	LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buf);
-	sm = (sequence_magic *) PageGetSpecialPointer(page);
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page);
 
-	if (sm->magic != SEQ_MAGIC)
+	if (sm->seq_page_id != SEQ_MAGIC)
 		elog(ERROR, "bad magic number in sequence \"%s\": %08X",
-			 RelationGetRelationName(rel), sm->magic);
+			 RelationGetRelationName(rel), sm->seq_page_id);
 
 	lp = PageGetItemId(page, FirstOffsetNumber);
 	Assert(ItemIdIsNormal(lp));
@@ -1541,7 +1531,7 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	char	   *item;
 	Size		itemsz;
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
-	sequence_magic *sm;
+	SequencePageOpaqueData *sm;
 
 	/* Backup blocks are not used in seq records */
 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
@@ -1564,9 +1554,9 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 	 */
 	localpage = (Page) palloc(BufferGetPageSize(buffer));
 
-	PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
-	sm = (sequence_magic *) PageGetSpecialPointer(localpage);
-	sm->magic = SEQ_MAGIC;
+	PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData));
+	sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage);
+	sm->seq_page_id = SEQ_MAGIC;
 
 	item = (char *) xlrec + sizeof(xl_seq_rec);
 	itemsz = record->xl_len - sizeof(xl_seq_rec);
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 8819c00..2455878 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -18,6 +18,19 @@
 #include "nodes/parsenodes.h"
 #include "storage/relfilenode.h"
 
+/*
+ * Page opaque data in a sequence page
+ */
+typedef struct SequencePageOpaqueData
+{
+	uint32 seq_page_id;
+} SequencePageOpaqueData;
+
+/*
+ * This page ID is for the conveniende to be able to identify if a page
+ * is being used by a sequence.
+ */
+#define SEQ_MAGIC		0x1717
 
 typedef struct FormData_pg_sequence
 {
-- 
2.0.1

0002-Extract-generic-bash-initialization-process-from-pg_.patchtext/plain; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_.patchDownload

From 994c58197e876458c2f4cf4cbf95240a938910ec Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 4 Jul 2014 14:12:11 +0900
Subject: [PATCH 2/3] Extract generic bash initialization process from
 pg_upgrade

Such initialization is useful as well for some other utilities and makes
test settings consistent.
---
 contrib/pg_upgrade/test.sh | 47 +++--------------------------------
 src/test/shell/init_env.sh | 61 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 43 deletions(-)
 create mode 100644 src/test/shell/init_env.sh

diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 7bbd2c7..2e1c61a 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -9,24 +9,14 @@
 # Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 
-set -e
-
-: ${MAKE=make}
-
-# Guard against parallel make issues (see comments in pg_regress.c)
-unset MAKEFLAGS
-unset MAKELEVEL
-
-# Establish how the server will listen for connections
-testhost=`uname -s`
+# Initialize test
+. ../../src/test/shell/init_env.sh
 
 case $testhost in
 	MINGW*)
-		LISTEN_ADDRESSES="localhost"
 		PGHOST=""; unset PGHOST
 		;;
 	*)
-		LISTEN_ADDRESSES=""
 		# Select a socket directory.  The algorithm is from the "configure"
 		# script; the outcome mimics pg_regress.c:make_temp_sockdir().
 		PGHOST=$PG_REGRESS_SOCK_DIR
@@ -102,37 +92,8 @@ logdir=$PWD/log
 rm -rf "$logdir"
 mkdir "$logdir"
 
-# Clear out any environment vars that might cause libpq to connect to
-# the wrong postmaster (cf pg_regress.c)
-#
-# Some shells, such as NetBSD's, return non-zero from unset if the variable
-# is already unset. Since we are operating under 'set -e', this causes the
-# script to fail. To guard against this, set them all to an empty string first.
-PGDATABASE="";        unset PGDATABASE
-PGUSER="";            unset PGUSER
-PGSERVICE="";         unset PGSERVICE
-PGSSLMODE="";         unset PGSSLMODE
-PGREQUIRESSL="";      unset PGREQUIRESSL
-PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
-PGHOSTADDR="";        unset PGHOSTADDR
-
-# Select a non-conflicting port number, similarly to pg_regress.c
-PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$newsrc"/src/include/pg_config.h | awk '{print $3}'`
-PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
-export PGPORT
-
-i=0
-while psql -X postgres </dev/null 2>/dev/null
-do
-	i=`expr $i + 1`
-	if [ $i -eq 16 ]
-	then
-		echo port $PGPORT apparently in use
-		exit 1
-	fi
-	PGPORT=`expr $PGPORT + 1`
-	export PGPORT
-done
+# Get a port to run the tests
+pg_get_test_port "$newsrc"
 
 # buildfarm may try to override port via EXTRA_REGRESS_OPTS ...
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --port=$PGPORT"
diff --git a/src/test/shell/init_env.sh b/src/test/shell/init_env.sh
new file mode 100644
index 0000000..e4d6cdb
--- /dev/null
+++ b/src/test/shell/init_env.sh
@@ -0,0 +1,61 @@
+#!/bin/sh
+
+# src/test/shell/init.sh
+#
+# Utility initializing environment for tests to be conducted in shell.
+# The initialization done here is made to be platform-proof.
+
+set -e
+
+: ${MAKE=make}
+
+# Guard against parallel make issues (see comments in pg_regress.c)
+unset MAKEFLAGS
+unset MAKELEVEL
+
+# Set listen_addresses desirably
+testhost=`uname -s`
+case $testhost in
+	MINGW*)	LISTEN_ADDRESSES="localhost" ;;
+	*)		LISTEN_ADDRESSES="" ;;
+esac
+
+# Clear out any environment vars that might cause libpq to connect to
+# the wrong postmaster (cf pg_regress.c)
+#
+# Some shells, such as NetBSD's, return nonzero from unset if the variable
+# is already unset. Since we are operating under 'set e', this causes the
+# script to fail. To guard against this, set them all to an empty string first.
+PGDATABASE="";        unset PGDATABASE
+PGUSER="";            unset PGUSER
+PGSERVICE="";         unset PGSERVICE
+PGSSLMODE="";         unset PGSSLMODE
+PGREQUIRESSL="";      unset PGREQUIRESSL
+PGCONNECT_TIMEOUT=""; unset PGCONNECT_TIMEOUT
+PGHOST="";            unset PGHOST
+PGHOSTADDR="";        unset PGHOSTADDR
+
+# Select a non-conflicting port number, similarly to pg_regress.c, and
+# save its value in PGPORT. Caller should either save or use this value
+# for the tests.
+pg_get_test_port()
+{
+	PG_ROOT_DIR=$1
+	PG_VERSION_NUM=`grep '#define PG_VERSION_NUM' "$PG_ROOT_DIR"/src/include/pg_config.h | awk '{print $3}'`
+	PGPORT=`expr $PG_VERSION_NUM % 16384 + 49152`
+	PGPORT_START=$PGPORT
+	export PGPORT
+
+	i=0
+	while psql -X postgres </dev/null 2>/dev/null
+	do
+		i=`expr $i + 1`
+		if [ $i -eq 16 ]
+		then
+			echo "Ports from $PGPORT_START to $PGPORT are apparently in use"
+			exit 1
+		fi
+		PGPORT=`expr $PGPORT + 1`
+		export PGPORT
+	done
+}
-- 
2.0.1

0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/plain; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload

From 077d675795b4907904fa4e85abed8c4528f4666f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Sat, 19 Jul 2014 10:40:20 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines that
can be used to check for consistency at page level when replaying WAL
files among several nodes of a cluster (generally master and standby
node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then each
buffer is captured is with the following format as a single line of
the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between pages,
and format is chosen to facilitate comparison between buffer entries.
- A client part, located in contrib/buffer_capture_cmp, that can be used
to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a CFLAGS
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in test-default.sh
but user is free to set up custom tests by creating a file called
test-custom.sh that can be kicked by the test facility if this file
is present instead of the defaults.
---
 contrib/Makefile                                |   1 +
 contrib/buffer_capture_cmp/.gitignore           |   9 +
 contrib/buffer_capture_cmp/Makefile             |  35 ++
 contrib/buffer_capture_cmp/README               |  35 ++
 contrib/buffer_capture_cmp/buffer_capture_cmp.c | 327 +++++++++++++++
 contrib/buffer_capture_cmp/test-default.sh      |  26 ++
 contrib/buffer_capture_cmp/test.sh              | 184 +++++++++
 src/backend/access/heap/heapam.c                |  11 +
 src/backend/storage/buffer/Makefile             |   2 +-
 src/backend/storage/buffer/bufcapt.c            | 504 ++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c             |  40 +-
 src/backend/storage/lmgr/lwlock.c               |   8 +
 src/backend/storage/page/bufpage.c              |   1 -
 src/include/commands/sequence.h                 |   2 +-
 src/include/miscadmin.h                         |  13 +
 src/include/storage/bufcapt.h                   |  66 ++++
 16 files changed, 1260 insertions(+), 4 deletions(-)
 create mode 100644 contrib/buffer_capture_cmp/.gitignore
 create mode 100644 contrib/buffer_capture_cmp/Makefile
 create mode 100644 contrib/buffer_capture_cmp/README
 create mode 100644 contrib/buffer_capture_cmp/buffer_capture_cmp.c
 create mode 100644 contrib/buffer_capture_cmp/test-default.sh
 create mode 100644 contrib/buffer_capture_cmp/test.sh
 create mode 100644 src/backend/storage/buffer/bufcapt.c
 create mode 100644 src/include/storage/bufcapt.h

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..1c8e6b9 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		btree_gin	\
 		btree_gist	\
+		buffer_capture_cmp \
 		chkpass		\
 		citext		\
 		cube		\
diff --git a/contrib/buffer_capture_cmp/.gitignore b/contrib/buffer_capture_cmp/.gitignore
new file mode 100644
index 0000000..ecd8b78
--- /dev/null
+++ b/contrib/buffer_capture_cmp/.gitignore
@@ -0,0 +1,9 @@
+# Binary generated
+/buffer_capture_cmp
+
+# Regression tests
+/capture_differences.txt
+/tmp_check
+
+# Custom test file
+/test-custom.sh
diff --git a/contrib/buffer_capture_cmp/Makefile b/contrib/buffer_capture_cmp/Makefile
new file mode 100644
index 0000000..f4d2d8d
--- /dev/null
+++ b/contrib/buffer_capture_cmp/Makefile
@@ -0,0 +1,35 @@
+# contrib/buffer_capture_cmp/Makefile
+
+PGFILEDESC = "buffer_capture_cmp - Comparator tool between buffer captures"
+PGAPPICON = win32
+
+PROGRAM = buffer_capture_cmp
+OBJS	= buffer_capture_cmp.o
+
+PG_CPPFLAGS = -I$(libpq_srcdir) -DFRONTEND
+PG_LIBS = $(libpq_pgport)
+
+EXTRA_CLEAN = tmp_check/ capture_differences.txt
+
+# test.sh creates a cluster dedicated for the test
+EXTRA_REGRESS_OPTS = --use-existing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/buffer_capture_cmp
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# Tests can be done only when BUFFER_CAPTURE is defined
+ifneq (,$(filter -DBUFFER_CAPTURE,$(CFLAGS)))
+check: test.sh all
+	MAKE=$(MAKE) bindir=$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) ./test.sh
+else
+check:
+	echo "BUFFER_CAPTURE is not defined in CFLAGS, so no tests to run"
+endif
diff --git a/contrib/buffer_capture_cmp/README b/contrib/buffer_capture_cmp/README
new file mode 100644
index 0000000..6c35e85
--- /dev/null
+++ b/contrib/buffer_capture_cmp/README
@@ -0,0 +1,35 @@
+buffer_capture_cmp
+------------------
+
+This facility already contains a set of regression tests that can be run
+by default with the following command:
+
+    make check
+
+The regression scripts contain a hook that can be used as an entry point
+to run some custom tests using this facility. Simply create in this folder
+a file called test-custom.sh and execute all the commands necessary for the
+tests. This custom script should use PGPORT as port to connect to the
+server where tests are run. The regression suite controls by itself
+PostgreSQL servers necessary for the tests, hence --use-existing is used
+in EXTRA_REGRESS_OPTS.
+
+Tips
+----
+
+The page images take up a lot of disk space! The PostgreSQL regression
+suite generates about 11GB - double that when the same is generated also
+in a standby.
+
+Always stop the master first, then standby. Otherwise, when you restart
+the standby, it will start WAL replay from the previous checkpoint, and
+log some page images already. Stopping the master creates a checkpoint
+record, avoiding the problem.
+
+It is possible to use pg_xlogdump to see which WAL record a page image
+corresponds to. But beware that the LSN in the page image points to the
+*end* of the WAL record, while the LSN that pg_xlogdump prints is the
+*beginning* of the WAL record. So to find which WAL record a page image
+corresponds to, find the LSN from the page image in pg_xlogdump output,
+and back off one record. (you can't just grep for the line containing
+the LSN).
diff --git a/contrib/buffer_capture_cmp/buffer_capture_cmp.c b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
new file mode 100644
index 0000000..edf9bed
--- /dev/null
+++ b/contrib/buffer_capture_cmp/buffer_capture_cmp.c
@@ -0,0 +1,327 @@
+/*-------------------------------------------------------------------------
+ *
+ * buffer_capture_cmp.c
+ *	  Utility tool to compare buffer captures between two nodes of
+ *	  a cluster. This utility needs to be run on servers whose code
+ *	  has been built with the symbol BUFFER_CAPTURE defined.
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    contrib/buffer_capture_cmp/buffer_capture_cmp.c
+ *
+ * Capture files that can be obtained on nodes of a cluster do not
+ * necessarily have the page images logged in the same order. For
+ * example, if a single WAL-logged operation modifies multiple pages,
+ * like an index page split, the standby might release the locks
+ * in different order than the master. Another cause is concurrent
+ * operations; writing the page images is not atomic with WAL insertion,
+ * so if two backends are running concurrently, their modifications in
+ * the image log can be interleaved in different order than their WAL
+ * records.
+ *
+ * To fix that, the lines from the capture file are read into a reorder
+ * buffer, and sorted there. Sorting the whole file would be overkill,
+ * as the lines are mostly in order already. The fixed-size reorder
+ * buffer works as long as the lines are not out-of-order by more than
+ * REORDER_BUFFER_SIZE lines.
+ */
+
+#include "postgres_fe.h"
+#include "port.h"
+
+/* Size of a single entry of the capture file */
+#define LINESZ (BLCKSZ*2 + 31)
+
+/* Line reordering stuff */
+#define REORDER_BUFFER_SIZE 1000
+
+typedef struct
+{
+	char	   *lines[REORDER_BUFFER_SIZE];
+	int 		nlines;		/* Number of lines currently in buffer */
+
+	FILE	   *fp;
+	bool		eof;		/* Has EOF been reached from this source? */
+} reorder_buffer;
+
+/*
+ * Read lines from the capture file into the reorder buffer, until the
+ * buffer is full.
+ */
+static void
+fill_reorder_buffer(reorder_buffer *buf)
+{
+	if (buf->eof)
+		return;
+
+	while (buf->nlines < REORDER_BUFFER_SIZE)
+	{
+		char *linebuf = pg_malloc(LINESZ);
+
+		if (fgets(linebuf, LINESZ, buf->fp) == NULL)
+		{
+			buf->eof = true;
+			pg_free(linebuf);
+			break;
+		}
+
+		/*
+		 * Common case, entry goes to the end. This happens for an
+		 * initialization or when buffer compares to be higher than
+		 * the last buffer in queue.
+		 */
+		if (buf->nlines == 0 ||
+			strcmp(linebuf, buf->lines[buf->nlines - 1]) >= 0)
+		{
+			buf->lines[buf->nlines] = linebuf;
+		}
+		else
+		{
+			/* Find the right place in the queue */
+			int			i;
+
+			/*
+			 * Scan all the existing buffers and find where buffer needs
+			 * to be included. We already know that the comparison result
+			 * we the last buffer in list.
+			 */
+			for (i = buf->nlines - 1; i >= 0; i--)
+			{
+				if (strcmp(linebuf, buf->lines[i]) >= 0)
+					break;
+			}
+
+			/* Place buffer correctly in the list */
+			i++;
+			memmove(&buf->lines[i + 1], &buf->lines[i],
+					(buf->nlines - i) * sizeof(char *));
+			buf->lines[i] = linebuf;
+		}
+		buf->nlines++;
+	}
+}
+
+/*
+ * Initialize a reorder buffer.
+ */
+static reorder_buffer *
+init_reorder_buffer(FILE *fp)
+{
+	reorder_buffer *buf;
+
+	buf = pg_malloc(sizeof(reorder_buffer));
+	buf->fp = fp;
+	buf->eof = false;
+	buf->nlines = 0;
+
+	fill_reorder_buffer(buf);
+
+	return buf;
+}
+
+/*
+ * Read all the lines that belong to the next WAL record from the reorder
+ * buffer.
+ */
+static int
+readrecord(reorder_buffer *buf, char **lines, uint64 *lsn)
+{
+	int			nlines;
+	uint32		line_xlogid;
+	uint32		line_xrecoff;
+	uint64		line_lsn;
+	uint64		rec_lsn;
+
+	/* Get all the lines with the same LSN */
+	for (nlines = 0; nlines < buf->nlines; nlines++)
+	{
+		/* Fetch LSN from the first 8 bytes of the buffer */
+		sscanf(buf->lines[nlines], "LSN: %08X/%08X",
+			   &line_xlogid, &line_xrecoff);
+		line_lsn = ((uint64) line_xlogid) << 32 | (uint64) line_xrecoff;
+
+		if (nlines == 0)
+			*lsn = rec_lsn = line_lsn;
+		else
+		{
+			if (line_lsn != rec_lsn)
+				break;
+		}
+	}
+
+	if (nlines == buf->nlines)
+	{
+		if (!buf->eof)
+		{
+			fprintf(stderr, "max number of lines in record reached, LSN: %X/%08X\n",
+					line_xlogid, line_xrecoff);
+			exit(1);
+		}
+	}
+
+	/* Consume the lines from the reorder buffer */
+	memcpy(lines, buf->lines, sizeof(char *) * nlines);
+	memmove(&buf->lines[0], &buf->lines[nlines],
+			sizeof(char *) * (buf->nlines - nlines));
+	buf->nlines -= nlines;
+
+	fill_reorder_buffer(buf);
+	return nlines;
+}
+
+/*
+ * Free all the given records.
+ */
+static void
+freerecord(char **lines, int nlines)
+{
+	int                     i;
+
+	for (i = 0; i < nlines; i++)
+		pg_free(lines[i]);
+}
+
+/*
+ * Print out given records.
+ */
+static void
+printrecord(char **lines, int nlines)
+{
+	int			i;
+
+	for (i = 0; i < nlines; i++)
+		printf("%s", lines[i]);
+}
+
+/*
+ * Do a direct comparison between the two given records that have the
+ * same LSN entry.
+ */
+static bool
+diffrecord(char **lines_a, int nlines_a, char **lines_b, int nlines_b)
+{
+	int i;
+
+	/* Leave if they do not have the same number of records */
+	if (nlines_a != nlines_b)
+		return true;
+
+	/*
+	 * Now do a comparison line by line. If any diffs are found at
+	 * character-level they will be printed out.
+	 */
+	for (i = 0; i < nlines_a; i++)
+	{
+		if (strcmp(lines_a[i], lines_b[i]) != 0)
+		{
+			int strlen_a = strlen(lines_a[i]);
+			int strlen_b = strlen(lines_b[i]);
+			int j;
+
+			printf("strlen_a: %d, strlen_b: %d\n",
+				   strlen_a, strlen_b);
+			for (j = 0; j < strlen_a; j++)
+			{
+				char char_a = lines_a[i][j];
+				char char_b = lines_b[i][j];
+				if (char_a != char_b)
+					printf("position: %d, char_a: %c, char_b: %c\n",
+						   j, char_a, char_b);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void
+usage(void)
+{
+	printf("usage: postprocess-images <master's capture file> <standby's capture file>\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *lines_a[REORDER_BUFFER_SIZE];
+	int			nlines_a;
+	char	   *lines_b[REORDER_BUFFER_SIZE];
+	int			nlines_b;
+	char	   *path_a, *path_b;
+	FILE	   *fp_a;
+	FILE	   *fp_b;
+	uint64		lsn_a;
+	uint64		lsn_b;
+	reorder_buffer *buf_a;
+	reorder_buffer *buf_b;
+
+	if (argc != 3)
+	{
+		usage();
+		exit(1);
+	}
+
+	/* Open first file */
+	path_a = pg_strdup(argv[1]);
+	fp_a = fopen(path_a, "rb");
+	if (fp_a == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_a);
+		exit(1);
+	}
+
+	/* Open second file */
+	path_b = pg_strdup(argv[2]);
+	fp_b = fopen(path_b, "rb");
+	if (fp_b == NULL)
+	{
+		fprintf(stderr, "Could not open file \"%s\"\n", path_b);
+		exit(1);
+	}
+
+	/* Initialize buffers for first loop */
+	buf_a = init_reorder_buffer(fp_a);
+	buf_b = init_reorder_buffer(fp_b);
+
+	/* Read first record from both */
+	nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+	nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+
+	/* Do comparisons as long as there are entries */
+	while (nlines_a > 0 || nlines_b > 0)
+	{
+		/* Compare the records */
+		if (lsn_a < lsn_b || nlines_b == 0)
+		{
+			printf("Only in A:\n");
+			printrecord(lines_a, nlines_a);
+			freerecord(lines_a, nlines_a);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+		}
+		else if (lsn_a > lsn_b || nlines_a == 0)
+		{
+			printf("Only in B:\n");
+			printrecord(lines_b, nlines_b);
+			freerecord(lines_b, nlines_b);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+		else if (lsn_a == lsn_b)
+		{
+			if (diffrecord(lines_a, nlines_a, lines_b, nlines_b))
+			{
+				printf("Lines differ, A:\n");
+				printrecord(lines_a, nlines_a);
+				printf("B:\n");
+				printrecord(lines_b, nlines_b);
+			}
+			freerecord(lines_a, nlines_a);
+			freerecord(lines_b, nlines_b);
+			nlines_a = readrecord(buf_a, lines_a, &lsn_a);
+			nlines_b = readrecord(buf_b, lines_b, &lsn_b);
+		}
+	}
+
+	return 0;
+}
diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh
new file mode 100644
index 0000000..914c8a6
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test-default.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# Default test suite for buffer_compare_cmp
+
+# PGPORT is already set so process should refer to that when kicking tests
+
+# In order to run the regression test suite, copy this file as test-custom.sh
+# and then uncomment the following lines:
+# echo ROOT_DIR=$PWD
+# psql -c 'CREATE DATABASE regression'
+# cd ../../src/test/regress && make installcheck 2>&1 > /dev/null
+# cd $ROOT_DIR
+
+# Run a simple set of queries
+psql --no-psqlrc <<EOF
+CREATE TABLE tbtree AS SELECT generate_series(1, 10) AS a;
+CREATE INDEX i_tbtree ON tbtree USING btree(a);
+CREATE TABLE thash AS SELECT generate_series(1, 10) AS a;
+CREATE INDEX i_thash ON thash USING hash(a);
+CREATE TABLE tgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a;
+CREATE INDEX i_tgist ON tgist USING gist(p1);
+CREATE TABLE tspgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a;
+CREATE INDEX i_tspgist ON tspgist USING spgist(p1);
+CREATE TABLE tgin AS SELECT ARRAY[a/10, a%10] as a1 from generate_series(0, 10) a;
+CREATE INDEX i_tgin ON tgin USING gin(a1);
+CREATE SEQUENCE sq1;
+EOF
diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh
new file mode 100644
index 0000000..6563811
--- /dev/null
+++ b/contrib/buffer_capture_cmp/test.sh
@@ -0,0 +1,184 @@
+#!/bin/bash
+
+# contrib/buffer_capture_cmp/test.sh
+#
+# Test driver for buffer_capture_cmp. It does the following processing:
+#
+# 1) Initialization of a master and a standby
+# 2) Stop master, then standby
+# 3) Remove $PGDATA/buffer_capture on master and standby
+# 4) Start master, then standby
+# 5) Run custom or default series of tests
+# 6) Stop master, then standby
+# 7) Compare the buffer capture of both nodes
+# 8) The diffence file should be empty
+#
+# Portions Copyright (c) 2006-2014, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+
+# Initialize environment
+. ../../src/test/shell/init_env.sh
+
+# Adjust these paths for your environment
+TESTROOT=$PWD/tmp_check
+TEST_MASTER=$TESTROOT/data_master
+TEST_STANDBY=$TESTROOT/data_standby
+
+# Create the root folder for test data
+if [ ! -d $TESTROOT ]; then
+	mkdir -p $TESTROOT
+fi
+
+export PGDATABASE="postgres"
+
+# Set up PATH correctly
+PATH=$bindir:$PATH
+export PATH
+
+# Get port values for master node
+pg_get_test_port ../..
+PG_MASTER_PORT=$PGPORT
+
+# Enable echo so the user can see what is being executed
+set -x
+
+# Fetch file name containing buffer captures
+CAPTURE_FILE_NAME=`grep '#define BUFFER_CAPTURE_FILE' "$PG_ROOT_DIR"/src/include/storage/bufcapt.h | awk '{print $3}' | sed 's/\"//g'`
+CAPTURE_FILE_MASTER=$TEST_MASTER/$CAPTURE_FILE_NAME
+CAPTURE_FILE_STANDBY=$TEST_STANDBY/$CAPTURE_FILE_NAME
+
+# Set up the nodes, first the master
+rm -rf $TEST_MASTER
+initdb -N -A trust -D $TEST_MASTER
+
+# Custom parameters for master's postgresql.conf
+cat >> $TEST_MASTER/postgresql.conf <<EOF
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+checkpoint_segments = 50
+shared_buffers = 1MB
+log_line_prefix = 'M  %m %p '
+hot_standby = on
+autovacuum = off
+max_connections = 50
+listen_addresses = '$LISTEN_ADDRESSES'
+port = $PG_MASTER_PORT
+EOF
+
+# Accept replication connections on master
+cat >> $TEST_MASTER/pg_hba.conf <<EOF
+local replication all trust
+host replication all 127.0.0.1/32 trust
+host replication all ::1/128 trust
+EOF
+
+# Start master
+pg_ctl -w -D $TEST_MASTER start
+
+# Now the standby
+echo "Master initialized and running."
+
+# Set up standby with necessary parameters
+rm -rf $TEST_STANDBY
+
+# Base backup is taken with xlog files included
+pg_basebackup -D $TEST_STANDBY -p $PG_MASTER_PORT -x
+
+# Get a fresh port value for the standby
+pg_get_test_port ../..
+PG_STANDBY_PORT=$PGPORT
+echo standby: $PG_STANDBY_PORT master: $PG_MASTER_PORT
+echo "port = $PG_STANDBY_PORT" >> $TEST_STANDBY/postgresql.conf
+
+# Still need to set up PGPORT for subsequent tests
+PGPORT=$PG_MASTER_PORT
+export PGPORT
+
+cat > $TEST_STANDBY/recovery.conf <<EOF
+primary_conninfo='port=$PG_MASTER_PORT'
+standby_mode=on
+recovery_target_timeline='latest'
+EOF
+
+# Start standby
+pg_ctl -w -D $TEST_STANDBY start
+
+# Stop both nodes and remove the file containing the buffer captures
+# Master needs to be stopped first.
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+rm -rf $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY
+
+# Re-start the nodes
+pg_ctl -w -D $TEST_MASTER start
+pg_ctl -w -D $TEST_STANDBY start
+
+# Check the presence of custom tests and kick them in priority. If not,
+# fallback to the default tests. Tests need only to be run on the master
+# node.
+if [ -f ./test-custom.sh ]; then
+	TEST_SCRIPT=test-custom.sh
+else
+	TEST_SCRIPT=test-default.sh
+fi
+
+# In case of an error from the test script, wait that clusters have
+# been stopped before reporting anything, we do not want nodes still
+# running after this test particularly if user has done some stupid
+# things with a custom script.
+set +e
+${SHELL} -e $TEST_SCRIPT
+EXITSTATUS=$?
+set -e
+
+# Time to stop the nodes as tests have run
+pg_ctl -w -D $TEST_MASTER stop
+pg_ctl -w -D $TEST_STANDBY stop
+
+if [ $EXITSTATUS != 0 ]; then
+	echo "$TEST_SCRIPT exited with code $EXITSTATUS"
+	exit $EXITSTATUS;
+fi
+
+DIFF_FILE=capture_differences.txt
+
+# Check if the capture files exist. If not, build may have not been
+# done with BUFFER_CAPTURE enabled.
+if [ ! -f $CAPTURE_FILE_MASTER ]; then
+	echo "Capture file $CAPTURE_FILE_MASTER is missing on master"
+	echo "Has build been done with -DBUFFER_CAPTURE included in CFLAGS?"
+	exit 0
+fi
+if [ ! -f $CAPTURE_FILE_STANDBY ]; then
+	echo "Capture file $CAPTURE_FILE_STANDBY is missing on standby"
+	echo "Has build been done with -DBUFFER_CAPTURE included in CFLAGS"
+	exit 0
+fi
+
+# Now compare the buffer images
+# Disable erroring here, buffer capture file may not be present
+# if cluster has not been built with symbol BUFFER_CAPTURE
+set +e
+./buffer_capture_cmp $CAPTURE_FILE_MASTER $CAPTURE_FILE_STANDBY > $DIFF_FILE
+ERR_NUM=$?
+
+# Leave on error
+if [ $ERR_NUM == 1 ]; then
+	echo "FAILED"
+	exit 1
+fi
+
+# No need to echo commands anymore
+set +x
+echo
+
+# Test passes if there are no diffs!
+if [ ! -s $DIFF_FILE ]; then
+	echo "PASSED"
+    exit 0
+else
+	echo "Diffs exist in the buffer captures"
+	echo "FAILED"
+	exit 1
+fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6861ae0..d43406c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,9 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -7010,6 +7013,14 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 
 	END_CRIT_SECTION();
 
+#ifdef BUFFER_CAPTURE
+	/*
+	 * The normal mechanism that hooks into LockBuffer doesn't work for this,
+	 * because we're bypassing buffer manager.
+	 */
+	buffer_capture_write(page, blkno);
+#endif
+
 	return recptr;
 }
 
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..6ec85b0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufcapt.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufcapt.c b/src/backend/storage/buffer/bufcapt.c
new file mode 100644
index 0000000..57c8c6e
--- /dev/null
+++ b/src/backend/storage/buffer/bufcapt.c
@@ -0,0 +1,504 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.c
+ *	  Routines for buffer capture, including masking and dynamic buffer
+ *	  snapshot.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/page/bufcapt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/gist.h"
+#include "access/gin_private.h"
+#include "access/htup_details.h"
+#include "access/spgist_private.h"
+#include "commands/sequence.h"
+#include "storage/bufcapt.h"
+#include "storage/bufmgr.h"
+#include "storage/lwlock.h"
+
+/* Marker used to mask pages consistently */
+#define MASK_MARKER				0xFF
+
+/* Support for capturing changes to pages per process */
+#define MAX_BEFORE_IMAGES		100
+
+typedef struct
+{
+	Buffer		  buffer;
+	char			content[BLCKSZ];
+} BufferImage;
+
+static BufferImage *before_images[MAX_BEFORE_IMAGES];
+static FILE *imagefp;
+static int before_images_cnt = 0;
+
+/* ----------------------------------------------------------------
+ * Masking functions.
+ *
+ * Most pages cannot be compared directly, because some parts of the
+ * page are not expected to be byte-by-byte identical. For example,
+ * hint bits or unused space in the page. The strategy is to normalize
+ * all pages by creating a mask of those bits that are not expected to
+ * match.
+ */
+static void mask_unused_space(char *page);
+static void mask_heap_page(char *page);
+static void mask_spgist_page(char *page);
+static void mask_gist_page(char *page);
+static void mask_gin_page(BlockNumber blkno, char *page);
+static void mask_sequence_page(char *page);
+static void mask_btree_page(char *page);
+
+/*
+ * Mask the unused space of a page between pd_lower and pd_upper.
+ */
+static void
+mask_unused_space(char *page)
+{
+	int		 pd_lower = ((PageHeader) page)->pd_lower;
+	int		 pd_upper = ((PageHeader) page)->pd_upper;
+	int		 pd_special = ((PageHeader) page)->pd_special;
+
+	/* Sanity check */
+	if (pd_lower > pd_upper || pd_special < pd_upper ||
+		pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
+	{
+		elog(ERROR, "invalid page at %X/%08X\n",
+				((PageHeader) page)->pd_lsn.xlogid,
+				((PageHeader) page)->pd_lsn.xrecoff);
+	}
+
+	memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower);
+}
+
+/*
+ * Mask a heap page
+ */
+static void
+mask_heap_page(char *page)
+{
+	OffsetNumber off;
+	PageHeader phdr = (PageHeader) page;
+
+	mask_unused_space(page);
+
+	/* Ignore prune_xid (it's like a hint-bit) */
+	phdr->pd_prune_xid = 0xFFFFFFFF;
+
+	/* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */
+	phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+
+	/*
+	 * Also mask the all-visible flag.
+	 *
+	 * XXX: It is unfortunate that we have to do this. If the flag is set
+	 * incorrectly, that's serious, and we would like to catch it. If the flag
+	 * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+	 * records don't currently set the flag, even though it is set in the
+	 * master, so we must silence failures that that causes.
+	 */
+	phdr->pd_flags |= PD_ALL_VISIBLE;
+
+	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
+	{
+		ItemId	  iid = PageGetItemId(page, off);
+		char	   *page_item;
+
+		page_item = (char *) (page + ItemIdGetOffset(iid));
+
+		/*
+		 * Ignore hint bits and command ID.
+		 */
+		if (ItemIdIsNormal(iid))
+		{
+			HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
+
+			page_htup->t_infomask =
+				HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+				HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
+			page_htup->t_infomask |= HEAP_COMBOCID;
+			page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF;
+		}
+
+		/*
+		 * Ignore any padding bytes after the tuple, when the length of
+		 * the item is not MAXALIGNed.
+		 */
+		if (ItemIdHasStorage(iid))
+		{
+			int	 len = ItemIdGetLength(iid);
+			int	 padlen = MAXALIGN(len) - len;
+
+			if (padlen > 0)
+				memset(page_item + len, MASK_MARKER, padlen);
+		}
+	}
+}
+
+/*
+ * Mask a SpGist page
+ */
+static void
+mask_spgist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a GIST page
+ */
+static void
+mask_gist_page(char *page)
+{
+	mask_unused_space(page);
+}
+
+/*
+ * Mask a Gin page
+ */
+static void
+mask_gin_page(BlockNumber blkno, char *page)
+{
+	/* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */
+	if (blkno != 0)
+		mask_unused_space(page);
+}
+
+/*
+ * Mask a sequence page
+ */
+static void
+mask_sequence_page(char *page)
+{
+	/*
+	 * FIXME: currently, we just ignore sequence records altogether. nextval
+	 * records a different value in the WAL record than it writes to the
+	 * buffer. Ideally we would only mask out the value in the tuple.
+	 */
+	memset(page, MASK_MARKER, BLCKSZ);
+}
+
+/*
+ * Mask a btree page
+ */
+static void
+mask_btree_page(char *page)
+{
+	OffsetNumber off;
+	OffsetNumber maxoff;
+	BTPageOpaque maskopaq = (BTPageOpaque)
+		(((char *) page) + ((PageHeader) page)->pd_special);
+
+	/*
+	 * Mark unused space before any processing. This is important as it
+	 * uses pd_lower and pd_upper that may be masked on this page
+	 * afterwards if it is a deleted page.
+	 */
+	mask_unused_space(page);
+
+	/*
+	 * Mask everything on a DELETED page.
+	 */
+	if (((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags & BTP_DELETED)
+	{
+		/* Page content, between standard page header and opaque struct */
+		memset(page + SizeOfPageHeaderData, MASK_MARKER,
+			   BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)));
+
+		/* pd_lower and upper */
+		memset(&((PageHeader) page)->pd_lower, MASK_MARKER, sizeof(uint16));
+		memset(&((PageHeader) page)->pd_upper, MASK_MARKER, sizeof(uint16));
+	}
+	else
+	{
+		/*
+		 * Mask some line pointer bits, particularly those marked as
+		 * used on a master and unused on a standby.
+		 * XXX: This could be better with more refinement.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (off = 1; off <= maxoff; off++)
+		{
+			ItemId iid = PageGetItemId(page, off);
+
+			if (ItemIdIsUsed(iid))
+				iid->lp_flags = LP_UNUSED;
+		}
+	}
+
+	/*
+	 * Mask BTP_HAS_GARBAGE flag. This needs to be done at the end
+	 * of process as previous masking operations could generate some
+	 * garbage.
+	 */
+	maskopaq->btpo_flags |= BTP_HAS_GARBAGE;
+}
+
+/* ----------------------------------------------------------------
+ * Buffer capture functions.
+ *
+ * Those functions can be used to memorize the content of pages
+ * and flush them to BUFFER_CAPTURE_FILE when necessary.
+ */
+static bool
+lsn_is_updated(BufferImage *img)
+{
+	Page			newcontent = BufferGetPage(img->buffer);
+	Page			oldcontent = (Page) img->content;
+
+	if (PageGetLSN(oldcontent) == PageGetLSN(newcontent))
+		return false;
+	return true;
+}
+
+/*
+ * Initialize page capture
+ */
+void
+buffer_capture_init(void)
+{
+	int				i;
+	BufferImage	   *images;
+
+	/* Initialize page image capturing */
+	images = palloc(MAX_BEFORE_IMAGES * sizeof(BufferImage));
+
+	for (i = 0; i < MAX_BEFORE_IMAGES; i++)
+		before_images[i] = &images[i];
+
+	imagefp = fopen(BUFFER_CAPTURE_FILE, "ab");
+}
+
+/*
+ * buffer_capture_reset
+ *
+ * Reset buffer captures
+ */
+void
+buffer_capture_reset(void)
+{
+	if (before_images_cnt > 0)
+		elog(LOG, "Released all buffer captures");
+	before_images_cnt = 0;
+}
+
+/*
+ * buffer_capture_write
+ *
+ * Flush to file the new content page present here after applying a
+ * mask on it.
+ */
+void
+buffer_capture_write(char *newcontent,
+					 uint32 blkno)
+{
+	XLogRecPtr	newlsn = PageGetLSN((Page) newcontent);
+	char		page[BLCKSZ];
+	uint16		tail;
+
+	/*
+	 * We need a lock to make sure that only one backend writes to the file
+	 * at a time. Abuse SyncScanLock for that - it happens to never be used
+	 * while a buffer is locked/unlocked.
+	 */
+	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
+
+	/* Copy content of page before any operation */
+	memcpy(page, newcontent, BLCKSZ);
+
+	/*
+	 * Look at the size of the special area, and the last two bytes in
+	 * it, to detect what kind of a page it is. Then call the appropriate
+	 * masking function.
+	 */
+	memcpy(&tail, &page[BLCKSZ - 2], 2);
+	if (PageGetSpecialSize(page) == 0)
+	{
+		/* Case of a normal relation, it has an empty special area */
+		mask_heap_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) &&
+			 tail == GIST_PAGE_ID)
+	{
+		/* Gist page */
+		mask_gist_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(BTPageOpaqueData)) &&
+			 tail <= MAX_BT_CYCLE_ID)
+	{
+		/* btree page */
+		mask_btree_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(SpGistPageOpaqueData)) &&
+			 tail == SPGIST_PAGE_ID)
+	{
+		/* SpGist page */
+		mask_spgist_page(page);
+	}
+	else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GinPageOpaqueData)) ||
+			 PageGetSpecialSize(page) == MAXALIGN(sizeof(SequencePageOpaqueData)))
+	{
+		/*
+		 * The page found here is used either for a Gin index or a sequence.
+		 * Gin index pages do not have a proper identifier, so check if the page
+		 * is used by a sequence or not. If it is not the case, this page is used
+		 * by a gin index. It is still possible that a gin page covers with area
+		 * with exactly the same value as SEQ_MAGIC, but this is unlikely to happen.
+		 */
+		if (((SequencePageOpaqueData *) PageGetSpecialPointer(page))->seq_page_id == SEQ_MAGIC)
+			mask_sequence_page(page);
+		else
+			mask_gin_page(blkno, page);
+	}
+	else
+	{
+		/* Should not come here */
+		Assert(0);
+	}
+
+	/*
+	 * First write the LSN in a constant format to facilitate comparisons
+	 * between buffer captures.
+	 */
+	fprintf(imagefp, "LSN: %08X/%08X ",
+			(uint32) (newlsn >> 32), (uint32) newlsn);
+
+	/* Then write the page contents, in hex */
+	fprintf(imagefp, "page: ");
+	{
+		char	buf[BLCKSZ * 2];
+		int     j = 0;
+		int		i;
+		for (i = 0; i < BLCKSZ; i++)
+		{
+			const char *digits = "0123456789ABCDEF";
+			uint8 byte = (uint8) page[i];
+
+			buf[j++] = digits[byte >> 4];
+			buf[j++] = digits[byte & 0x0F];
+		}
+		fwrite(buf, BLCKSZ * 2, 1, imagefp);
+	}
+	fprintf(imagefp, "\n");
+
+	/* Then the masked page in hex format */
+	fflush(imagefp);
+
+	/* Clean up */
+	LWLockRelease(SyncScanLock);
+}
+
+/*
+ * buffer_capture_remember
+ *
+ * Append a page content to the existing list of buffers to-be-captured.
+ */
+void
+buffer_capture_remember(Buffer buffer)
+{
+	BufferImage *img;
+
+	Assert(before_images_cnt < MAX_BEFORE_IMAGES);
+
+	img = before_images[before_images_cnt];
+	img->buffer = buffer;
+	memcpy(img->content, BufferGetPage(buffer), BLCKSZ);
+	before_images_cnt++;
+}
+
+/*
+ * buffer_capture_forget
+ *
+ * Forget a page image. If the page was modified, log the new contents.
+ */
+void
+buffer_capture_forget(Buffer buffer)
+{
+	int			i;
+	RelFileNode	rnode;
+	ForkNumber	forknum;
+	BlockNumber	blkno;
+	Page		content;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		if (img->buffer == buffer)
+		{
+			/* If page has new content, capture it */
+			if (lsn_is_updated(img))
+			{
+				content = BufferGetPage(img->buffer);
+				BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+				buffer_capture_write(content, blkno);
+			}
+
+			if (i != before_images_cnt)
+			{
+				/* Swap the last still-used slot with this one */
+				before_images[i] = before_images[before_images_cnt - 1];
+				before_images[before_images_cnt - 1] = img;
+			}
+
+			before_images_cnt--;
+			return;
+		}
+	}
+
+	/* Gather some information about this buffer image not found */
+	content = BufferGetPage(buffer);
+	BufferGetTag(buffer, &rnode, &forknum, &blkno);
+	elog(LOG, "could not find image for buffer %u: LSN %X/%08X rnode %u, "
+			  "forknum %u, blkno %u",
+		 ((PageHeader) content)->pd_lsn.xlogid,
+		 ((PageHeader) content)->pd_lsn.xrecoff,
+		 buffer, rnode.relNode, forknum, blkno);
+}
+
+/*
+ * buffer_capture_scan
+ *
+ * See if any of the buffers that have been memorized have changed and
+ * update them if it is the case.
+ */
+void
+buffer_capture_scan(void)
+{
+	int i;
+
+	for (i = 0; i < before_images_cnt; i++)
+	{
+		BufferImage *img = before_images[i];
+
+		/*
+		 * Print the contents of the page if it was changed. Remember the
+		 * new contents as the current image.
+		 */
+		if (lsn_is_updated(img))
+		{
+			Page content = BufferGetPage(img->buffer);
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+
+			BufferGetTag(img->buffer, &rnode, &forknum, &blkno);
+			buffer_capture_write(content, blkno);
+			memcpy(img->content, BufferGetPage(img->buffer), BLCKSZ);
+		}
+	}
+}
+
+#endif /* BUFFER_CAPTURE */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..4624a99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -50,6 +50,9 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -1708,6 +1711,10 @@ AtEOXact_Buffers(bool isCommit)
 {
 	CheckForBufferLeaks();
 
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	AtEOXact_LocalBuffers(isCommit);
 }
 
@@ -1724,6 +1731,10 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef BUFFER_CAPTURE
+	buffer_capture_init();
+#endif
 }
 
 /*
@@ -2749,6 +2760,20 @@ LockBuffer(Buffer buffer, int mode)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_UNLOCK)
+	{
+		/*
+		 * Peek into the LWLock struct to see if we're holding it in exclusive
+		 * or shared mode. This is concurrency-safe: if we're holding it in
+		 * exclusive mode, no-one else can release it. If we're holding it
+		 * in shared mode, no-one else can acquire it in exclusive mode.
+		 */
+		if (buf->content_lock->exclusive > 0)
+			buffer_capture_forget(buffer);
+	}
+#endif
+
 	if (mode == BUFFER_LOCK_UNLOCK)
 		LWLockRelease(buf->content_lock);
 	else if (mode == BUFFER_LOCK_SHARE)
@@ -2757,6 +2782,11 @@ LockBuffer(Buffer buffer, int mode)
 		LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+
+#ifdef BUFFER_CAPTURE
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		buffer_capture_remember(buffer);
+#endif
 }
 
 /*
@@ -2768,6 +2798,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	volatile BufferDesc *buf;
+	bool	acquired;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
@@ -2775,7 +2806,14 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = &(BufferDescriptors[buffer - 1]);
 
-	return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+	acquired = LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+
+#ifdef BUFFER_CAPTURE
+	if (acquired)
+		buffer_capture_remember(buffer);
+#endif
+
+	return acquired;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a62af27..4671a55 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -41,6 +41,10 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#ifdef BUFFER_CAPTURE
+#include "storage/bufcapt.h"
+#endif
+
 #ifdef LWLOCK_STATS
 #include "utils/hsearch.h"
 #endif
@@ -1272,6 +1276,10 @@ LWLockRelease(LWLock *l)
 void
 LWLockReleaseAll(void)
 {
+#ifdef BUFFER_CAPTURE
+	buffer_capture_reset();
+#endif
+
 	while (num_held_lwlocks > 0)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..8b3b83c 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -21,7 +21,6 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
-
 /* GUC variable */
 bool		ignore_checksum_failure = false;
 
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index 2455878..cbd6780 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -27,7 +27,7 @@ typedef struct SequencePageOpaqueData
 } SequencePageOpaqueData;
 
 /*
- * This page ID is for the conveniende to be able to identify if a page
+ * This page ID is for the convenience to be able to identify if a page
  * is being used by a sequence.
  */
 #define SEQ_MAGIC		0x1717
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..1ae98f7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -116,11 +116,24 @@ do { \
 
 #define START_CRIT_SECTION()  (CritSectionCount++)
 
+#ifdef BUFFER_CAPTURE
+/* in src/backend/storage/buffer/bufcapt.c */
+void buffer_capture_scan(void);
+
+#define END_CRIT_SECTION() \
+do { \
+	Assert(CritSectionCount > 0); \
+	CritSectionCount--; \
+	if (CritSectionCount == 0) \
+		buffer_capture_scan(); \
+} while(0)
+#else
 #define END_CRIT_SECTION() \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
 } while(0)
+#endif
 
 
 /*****************************************************************************
diff --git a/src/include/storage/bufcapt.h b/src/include/storage/bufcapt.h
new file mode 100644
index 0000000..012e5eb
--- /dev/null
+++ b/src/include/storage/bufcapt.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufcapt.h
+ *	  Buffer capture definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufcapt.h
+ *
+ * About BUFFER_CAPTURE:
+ *
+ * If this symbol is defined, all page images that are logged on this
+ * server are as well flushed to BUFFER_CAPTURE_FILE. One line of the
+ * output file is used for a single page image.
+ *
+ * The page images obtained are aimed to be used with the utility tool
+ * called buffer_capture_cmp available in contrib/ and can be used to
+ * compare how WAL is replayed between master and standby nodes, helping
+ * in spotting bugs in this area. As each page is written in hexadecimal
+ * format, one single page writes BLCKSZ * 2 bytes to the capture file.
+ * Hexadecimal format makes it easier to spot differences between captures
+ * done among nodes. Be aware that this has a large impact on I/O and that
+ * it is aimed only for buildfarm and development purposes.
+ *
+ * One single page entry has the following format:
+ *	LSN: %08X/08X page: PAGE_IN_HEXA
+ *
+ * The LSN corresponds to the log sequence number to which the page image
+ * applies to, then the content of the image is added as-is. This format
+ * is chosen to facilitate comparisons between each capture entry
+ * particularly in cases where LSN increases its digit number. As unlogged
+ * relations do not have LSN numbers saved, their buffer modifications are
+ * not captured by this facility.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFCAPT_H
+#define BUFCAPT_H
+
+#ifdef BUFFER_CAPTURE
+
+#include "postgres.h"
+
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Output file where buffer captures are stored
+ */
+#define BUFFER_CAPTURE_FILE "buffer_captures"
+
+void buffer_capture_init(void);
+void buffer_capture_reset(void);
+void buffer_capture_write(char *newcontent,
+						  uint32 blkno);
+
+void buffer_capture_remember(Buffer buffer);
+void buffer_capture_forget(Buffer buffer);
+void buffer_capture_scan(void);
+
+#endif /* BUFFER_CAPTURE */
+
+#endif
-- 
2.0.1

#36

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Michael Paquier (#35)

Re: WAL replay bugs

Hello,

Although I doubt necessity of the flexibility seeing the current
testing framework, I don't have so strong objection about
that. Nevertheless, perhaps you are appreciated to put a notice
on.. README or somewhere.

Hm, well... Fine, I added it in this updated series.

Thank you for your patience:)

After all, I have no more comment about this patch. I will mark
this as 'Ready for committer' unless no comment comes from others
for a few days.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 11 years ago

In reply to: Kyotaro HORIGUCHI (#36)

Re: WAL replay bugs

Hi, I'm not so confident whether it's the time to do this...

I mark this as 'Ready for Committer' since no additional comment
or objection was put by others on this patch.

After all, I have no more comment about this patch. I will mark
this as 'Ready for committer' unless no comment comes from others
for a few days.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Alvaro Herrera

alvherre@2ndquadrant.com

about 11 years ago

In reply to: Michael Paquier (#35)

Re: WAL replay bugs

Michael Paquier wrote:

On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Although I doubt necessity of the flexibility seeing the current
testing framework, I don't have so strong objection about
that. Nevertheless, perhaps you are appreciated to put a notice
on.. README or somewhere.

Hm, well... Fine, I added it in this updated series.

Did this go anywhere?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Alvaro Herrera

alvherre@2ndquadrant.com

about 11 years ago

In reply to: Michael Paquier (#35)

Re: WAL replay bugs

Michael Paquier wrote:

On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Although I doubt necessity of the flexibility seeing the current
testing framework, I don't have so strong objection about
that. Nevertheless, perhaps you are appreciated to put a notice
on.. README or somewhere.

Hm, well... Fine, I added it in this updated series.

FWIW I gave this a trial run and found I needed some tweaks to test.sh
and the Makefile in order to make it work on VPATH; mainly replace ./
with `dirname $0` in a couple test.sh in a couple of places, and
something similar in the Makefile. Also you have $PG_ROOT_DIR somewhere
which doesn't work.

Also you have the Makefile checking for -DBUFFER_CAPTURE exactly but for
some reason I used -DBUFFER_CAPTURE=1 which wasn't well received by your
$(filter) stuff. Instead of checking CFLAGS it might make more sense to
expose it as a read-only GUC and grep `postmaster -C buffer_capture` or
similar.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Peter Eisentraut

peter_e@gmx.net

about 11 years ago

In reply to: Alvaro Herrera (#39)

Re: WAL replay bugs

On 11/4/14 3:21 PM, Alvaro Herrera wrote:

FWIW I gave this a trial run and found I needed some tweaks to test.sh
and the Makefile in order to make it work on VPATH; mainly replace ./
with `dirname $0` in a couple test.sh in a couple of places, and
something similar in the Makefile. Also you have $PG_ROOT_DIR somewhere
which doesn't work.

I also saw some bashisms in the script.

Maybe the time for shell-based test scripts has passed?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Michael Paquier

michael.paquier@gmail.com

about 11 years ago

In reply to: Alvaro Herrera (#39)

Re: WAL replay bugs

Thanks for the tests.

On Wed, Nov 5, 2014 at 5:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Michael Paquier wrote:

On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Although I doubt necessity of the flexibility seeing the current
testing framework, I don't have so strong objection about
that. Nevertheless, perhaps you are appreciated to put a notice
on.. README or somewhere.

Hm, well... Fine, I added it in this updated series.

FWIW I gave this a trial run and found I needed some tweaks to test.sh
and the Makefile in order to make it work on VPATH; mainly replace ./
with `dirname $0` in a couple test.sh in a couple of places, and
something similar in the Makefile. Also you have $PG_ROOT_DIR somewhere
which doesn't work.

Ah thanks, forgot that.

Also you have the Makefile checking for -DBUFFER_CAPTURE exactly but for
some reason I used -DBUFFER_CAPTURE=1 which wasn't well received by your
$(filter) stuff. Instead of checking CFLAGS it might make more sense to
expose it as a read-only GUC and grep `postmaster -C buffer_capture` or
similar.

Yes that's a good idea.

Now, do we really want this feature in-core? That's somewhat a duplicate of
what is mentioned here:
/messages/by-id/CAB7nPqQMq=4eJAK317mxZ4Has0i+1rSLBQU29zx18JwLB2j1OA@mail.gmail.com
Of course both things do not have the same coverage as the former is for
buildfarm and dev, while the latter is dedidated to production systems, but
could be used for development as well.

The patch sent there is a bit outdated, but a potential implementation gets
simpler with XLogReadBufferForRedo able to return flags about each block
state during redo. I am still planning to come back to it for this cycle,
though I stopped for now waiting for the WAL format patches finish to shape
the APIs this feature would rely on.
Regards,
--
Michael

#42

Michael Paquier

michael.paquier@gmail.com

about 11 years ago

In reply to: Peter Eisentraut (#40)

Re: WAL replay bugs

On Wed, Nov 5, 2014 at 6:29 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

On 11/4/14 3:21 PM, Alvaro Herrera wrote:

FWIW I gave this a trial run and found I needed some tweaks to test.sh
and the Makefile in order to make it work on VPATH; mainly replace ./
with `dirname $0` in a couple test.sh in a couple of places, and
something similar in the Makefile. Also you have $PG_ROOT_DIR somewhere
which doesn't work.

I also saw some bashisms in the script.

Maybe the time for shell-based test scripts has passed?

Except pg_upgrade, are there other tests using bash?
--
Michael

#43

Alvaro Herrera

alvherre@2ndquadrant.com

about 11 years ago

In reply to: Michael Paquier (#41)

Re: WAL replay bugs

Michael Paquier wrote:

Now, do we really want this feature in-core? That's somewhat a duplicate of
what is mentioned here:
/messages/by-id/CAB7nPqQMq=4eJAK317mxZ4Has0i+1rSLBQU29zx18JwLB2j1OA@mail.gmail.com
Of course both things do not have the same coverage as the former is for
buildfarm and dev, while the latter is dedidated to production systems, but
could be used for development as well.

Oh, I had forgotten that other patch.

The patch sent there is a bit outdated, but a potential implementation gets
simpler with XLogReadBufferForRedo able to return flags about each block
state during redo. I am still planning to come back to it for this cycle,
though I stopped for now waiting for the WAL format patches finish to shape
the APIs this feature would rely on.

I agree it makes sense to wait until the WAL reworks are done -- glad
to hear you're putting some time in this area.

Thanks,

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Peter Eisentraut

peter_e@gmx.net

about 11 years ago

In reply to: Michael Paquier (#42)

Re: WAL replay bugs

On 11/4/14 10:50 PM, Michael Paquier wrote:

Except pg_upgrade, are there other tests using bash?

There are a few obscure things under src/test/.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Michael Paquier

michael.paquier@gmail.com

about 11 years ago

In reply to: Peter Eisentraut (#44)

Re: WAL replay bugs

On Thu, Nov 6, 2014 at 5:41 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

On 11/4/14 10:50 PM, Michael Paquier wrote:

Except pg_upgrade, are there other tests using bash?

There are a few obscure things under src/test/.

Oh, I see. There is quite a number here, and each script is doing quite
different things..
$ git grep "/sh" src/test/
src/test/locale/de_DE.ISO8859-1/runall:#! /bin/sh
src/test/locale/gr_GR.ISO8859-7/runall:#! /bin/sh
src/test/locale/koi8-r/runall:#! /bin/sh
src/test/locale/koi8-to-win1251/runall:#! /bin/sh
src/test/mb/mbregress.sh:#! /bin/sh
src/test/performance/start-pgsql.sh:#!/bin/sh
src/test/regress/regressplans.sh:#! /bin/sh
--
Michael

#46

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Michael Paquier (#35)

Re: WAL replay bugs

Michael Paquier wrote:

From 077d675795b4907904fa4e85abed8c4528f4666f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Sat, 19 Jul 2014 10:40:20 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

Is there a newer version of this tech?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Alvaro Herrera (#46)

Re: WAL replay bugs

On Thu, Jun 18, 2015 at 3:39 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:

From 077d675795b4907904fa4e85abed8c4528f4666f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Sat, 19 Jul 2014 10:40:20 +0900
Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

Is there a newer version of this tech?

Not yet.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers