pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

Started by Pavan Deolaseealmost 10 years ago15 messages
#1Pavan Deolasee
pavan.deolasee@gmail.com
2 attachment(s)

Hello,

While investigating some issue, I found that pg_xlogdump fails to dump
contents from a WAL file if the file has continuation data from previous
WAL record and the data spans more than one page. In such cases,
XLogFindNextRecord() fails to take into account that there will be more
than one xlog page headers (long and short) and thus tries to read from an
offset where no valid record exists. That results in pg_xlogdump throwing
error such as:

pg_xlogdump: FATAL: could not find a valid record after 0/46000000

Attached WAL file from master branch demonstrates the issue, generated
using synthetic data. Also, attached patch fixes it for me.

While we could have deduced the number of short and long headers and
skipped directly to the offset, I found reading one page at a time and
using XLogPageHeaderSize() to find header size of each page separately, a
much cleaner way. Also, the continuation data is not going to span many
pages. So I don't see any harm in doing it that way.

I encountered this on 9.3, but the patch applies to both 9.3 and master. I
haven't tested it on other branches, but I have no reason to believe it
won't apply or work. I believe we should back patch it all supported
branches.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

xlogdump_multipage_cont_record.patchapplication/octet-stream; name=xlogdump_multipage_cont_record.patchDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..ad3ec3f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -855,35 +855,70 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
-	targetRecOff = RecPtr % XLOG_BLCKSZ;
+	/*
+	 * skip over potential continuation data, keeping in mind that it may span
+	 * multiple pages
+	 */
+	tmpRecPtr = RecPtr;
+	while (true)
+	{
+		/*
+		 * Compute targetRecOff. It should typically be greater than short
+		 * page-header since a valid record can't , but can also be zero when
+		 * caller has supplied a page-aligned address or when we are skipping
+		 * multi-page continuation record. It doesn't matter though because
+		 * ReadPageInternal() will read at least short page-header worth of
+		 * data
+		 */
+		targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
 
-	/* scroll back to page boundary */
-	targetPagePtr = RecPtr - targetRecOff;
+		/* scroll back to page boundary */
+		targetPagePtr = tmpRecPtr - targetRecOff;
 
-	/* Read the page containing the record */
-	readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
-	if (readLen < 0)
-		goto err;
+		/* Read the page containing the record */
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		if (readLen < 0)
+			goto err;
 
-	header = (XLogPageHeader) state->readBuf;
+		header = (XLogPageHeader) state->readBuf;
 
-	pageHeaderSize = XLogPageHeaderSize(header);
+		pageHeaderSize = XLogPageHeaderSize(header);
 
-	/* make sure we have enough data for the page header */
-	readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
-	if (readLen < 0)
-		goto err;
+		/* make sure we have enough data for the page header */
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		if (readLen < 0)
+			goto err;
 
-	/* skip over potential continuation data */
-	if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
-	{
-		/* record headers are MAXALIGN'ed */
-		tmpRecPtr = targetPagePtr + pageHeaderSize
-			+ MAXALIGN(header->xlp_rem_len);
-	}
-	else
-	{
-		tmpRecPtr = targetPagePtr + pageHeaderSize;
+		/* skip over potential continuation data */
+		if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+		{
+			/*
+			 * If the length of the remaining continuation data is more than
+			 * what can fit in this page, the continuation record crosses over
+			 * this page. Read the next page and try again. xlp_rem_len in the
+			 * next page header will contain the remaining length of the
+			 * continuation data
+			 *
+			 * Note that record headers are MAXALIGN'ed
+			 */
+			if (MAXALIGN(header->xlp_rem_len) > (XLOG_BLCKSZ - pageHeaderSize))
+				tmpRecPtr = targetPagePtr + XLOG_BLCKSZ;
+			else
+			{
+				/*
+				 * The previous continuation record ends in this page. Set
+				 * tmpRecPtr to point to the first valid record
+				 */
+				tmpRecPtr = targetPagePtr + pageHeaderSize
+					+ MAXALIGN(header->xlp_rem_len);
+				break;
+			}
+		}
+		else
+		{
+			tmpRecPtr = targetPagePtr + pageHeaderSize;
+			break;
+		}
 	}
 
 	/*
000000010000000000000046.gzapplication/x-gzip; name=000000010000000000000046.gzDownload
#2Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#1)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Wed, Mar 23, 2016 at 3:43 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

While investigating some issue, I found that pg_xlogdump fails to dump
contents from a WAL file if the file has continuation data from previous WAL
record and the data spans more than one page. In such cases,
XLogFindNextRecord() fails to take into account that there will be more than
one xlog page headers (long and short) and thus tries to read from an offset
where no valid record exists. That results in pg_xlogdump throwing error
such as:

OK. That makes sense, that is indeed a possible scenario.

Attached WAL file from master branch demonstrates the issue, generated using
synthetic data. Also, attached patch fixes it for me.

So does it for me.

While we could have deduced the number of short and long headers and skipped
directly to the offset, I found reading one page at a time and using
XLogPageHeaderSize() to find header size of each page separately, a much
cleaner way. Also, the continuation data is not going to span many pages. So
I don't see any harm in doing it that way.

I have to agree, the new code is pretty clean this way by calculating
the next page LSN directly in the loop should the record be longer
than it.

I encountered this on 9.3, but the patch applies to both 9.3 and master. I
haven't tested it on other branches, but I have no reason to believe it
won't apply or work. I believe we should back patch it all supported
branches.

Agreed, that's a bug, and the logic of your patch looks good to me.

+       /*
+        * Compute targetRecOff. It should typically be greater than short
+        * page-header since a valid record can't , but can also be zero when
+        * caller has supplied a page-aligned address or when we are skipping
+        * multi-page continuation record. It doesn't matter though because
+        * ReadPageInternal() will read at least short page-header worth of
+        * data
+        */
This needs some polishing, there is an unfinished sentence here.
+       targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
targetRecOff, pageHeaderSize and targetPagePtr could be declared
inside directly the new while loop.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#2)
1 attachment(s)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Wed, Mar 23, 2016 at 1:13 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+       /*
+        * Compute targetRecOff. It should typically be greater than short
+        * page-header since a valid record can't , but can also be zero
when
+        * caller has supplied a page-aligned address or when we are
skipping
+        * multi-page continuation record. It doesn't matter though because
+        * ReadPageInternal() will read at least short page-header worth of
+        * data
+        */
This needs some polishing, there is an unfinished sentence here.

+ targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
targetRecOff, pageHeaderSize and targetPagePtr could be declared
inside directly the new while loop.

Thanks Michael for reviewing the patch. I've fixed these issues and new
version is attached.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

xlogdump_multipage_cont_record_v2.patchapplication/octet-stream; name=xlogdump_multipage_cont_record_v2.patchDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..dcc1dd3 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -844,46 +844,83 @@ XLogRecPtr
 XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 {
 	XLogReaderState saved_state = *state;
-	XLogRecPtr	targetPagePtr;
 	XLogRecPtr	tmpRecPtr;
-	int			targetRecOff;
 	XLogRecPtr	found = InvalidXLogRecPtr;
-	uint32		pageHeaderSize;
 	XLogPageHeader header;
-	int			readLen;
 	char	   *errormsg;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
-	targetRecOff = RecPtr % XLOG_BLCKSZ;
+	/*
+	 * skip over potential continuation data, keeping in mind that it may span
+	 * multiple pages
+	 */
+	tmpRecPtr = RecPtr;
+	while (true)
+	{
+		XLogRecPtr	targetPagePtr;
+		int			targetRecOff;
+		uint32		pageHeaderSize;
+		int			readLen;
 
-	/* scroll back to page boundary */
-	targetPagePtr = RecPtr - targetRecOff;
+		/*
+		 * Compute targetRecOff. It should typically be equal or greater than
+		 * short page-header since a valid record can't start anywhere before
+		 * that, except when caller has explicitly specified the offset that
+		 * falls somewhere there or when we are skipping multi-page
+		 * continuation record. It doesn't matter though because
+		 * ReadPageInternal() is prepared to handle that and will read at least
+		 * short page-header worth of data
+		 */
+		targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
 
-	/* Read the page containing the record */
-	readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
-	if (readLen < 0)
-		goto err;
+		/* scroll back to page boundary */
+		targetPagePtr = tmpRecPtr - targetRecOff;
 
-	header = (XLogPageHeader) state->readBuf;
+		/* Read the page containing the record */
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		if (readLen < 0)
+			goto err;
 
-	pageHeaderSize = XLogPageHeaderSize(header);
+		header = (XLogPageHeader) state->readBuf;
 
-	/* make sure we have enough data for the page header */
-	readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
-	if (readLen < 0)
-		goto err;
+		pageHeaderSize = XLogPageHeaderSize(header);
 
-	/* skip over potential continuation data */
-	if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
-	{
-		/* record headers are MAXALIGN'ed */
-		tmpRecPtr = targetPagePtr + pageHeaderSize
-			+ MAXALIGN(header->xlp_rem_len);
-	}
-	else
-	{
-		tmpRecPtr = targetPagePtr + pageHeaderSize;
+		/* make sure we have enough data for the page header */
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		if (readLen < 0)
+			goto err;
+
+		/* skip over potential continuation data */
+		if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+		{
+			/*
+			 * If the length of the remaining continuation data is more than
+			 * what can fit in this page, the continuation record crosses over
+			 * this page. Read the next page and try again. xlp_rem_len in the
+			 * next page header will contain the remaining length of the
+			 * continuation data
+			 *
+			 * Note that record headers are MAXALIGN'ed
+			 */
+			if (MAXALIGN(header->xlp_rem_len) > (XLOG_BLCKSZ - pageHeaderSize))
+				tmpRecPtr = targetPagePtr + XLOG_BLCKSZ;
+			else
+			{
+				/*
+				 * The previous continuation record ends in this page. Set
+				 * tmpRecPtr to point to the first valid record
+				 */
+				tmpRecPtr = targetPagePtr + pageHeaderSize
+					+ MAXALIGN(header->xlp_rem_len);
+				break;
+			}
+		}
+		else
+		{
+			tmpRecPtr = targetPagePtr + pageHeaderSize;
+			break;
+		}
 	}
 
 	/*
#4Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#3)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Wed, Mar 23, 2016 at 7:04 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Wed, Mar 23, 2016 at 1:13 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+       /*
+        * Compute targetRecOff. It should typically be greater than short
+        * page-header since a valid record can't , but can also be zero
when
+        * caller has supplied a page-aligned address or when we are
skipping
+        * multi-page continuation record. It doesn't matter though
because
+        * ReadPageInternal() will read at least short page-header worth
of
+        * data
+        */
This needs some polishing, there is an unfinished sentence here.

+ targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
targetRecOff, pageHeaderSize and targetPagePtr could be declared
inside directly the new while loop.

Thanks Michael for reviewing the patch. I've fixed these issues and new
version is attached.

I'd just add dots at the end of the sentences in the comment blocks
because that's project style, but I'm being picky, except that the
logic looks quite good.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#4)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Wed, Mar 23, 2016 at 6:16 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

I'd just add dots at the end of the sentences in the comment blocks
because that's project style, but I'm being picky, except that the
logic looks quite good.

Since this is a bug affecting all stable branches, IMHO it will be a good
idea to fix this before the upcoming minor releases.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#6Andres Freund
andres@anarazel.de
In reply to: Pavan Deolasee (#5)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On 2016-03-28 13:21:41 +0530, Pavan Deolasee wrote:

On Wed, Mar 23, 2016 at 6:16 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

I'd just add dots at the end of the sentences in the comment blocks
because that's project style, but I'm being picky, except that the
logic looks quite good.

Since this is a bug affecting all stable branches, IMHO it will be a good
idea to fix this before the upcoming minor releases.

It's definitely too late for that; they're going to be wrapped in a
couple hours.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#6)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Mon, Mar 28, 2016 at 10:15 PM, Andres Freund <andres@anarazel.de> wrote:

It's definitely too late for that; they're going to be wrapped in a
couple hours.

I have added this patch to the next CF so as we do not lose track of this bug:
https://commitfest.postgresql.org/10/593/
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Craig Ringer
craig@2ndquadrant.com
In reply to: Pavan Deolasee (#3)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On 23 March 2016 at 18:04, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Wed, Mar 23, 2016 at 1:13 PM, Michael Paquier <
michael.paquier@gmail.com> wrote:

+       /*
+        * Compute targetRecOff. It should typically be greater than short
+        * page-header since a valid record can't , but can also be zero
when
+        * caller has supplied a page-aligned address or when we are
skipping
+        * multi-page continuation record. It doesn't matter though
because
+        * ReadPageInternal() will read at least short page-header worth
of
+        * data
+        */
This needs some polishing, there is an unfinished sentence here.

+ targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
targetRecOff, pageHeaderSize and targetPagePtr could be declared
inside directly the new while loop.

Thanks Michael for reviewing the patch. I've fixed these issues and new
version is attached.

Can you describe the process used to generate the sample WAL segment?

I'd like to turn it into a TAP test to go along with the patch.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Craig Ringer
craig@2ndquadrant.com
In reply to: Pavan Deolasee (#3)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On 23 March 2016 at 18:04, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Wed, Mar 23, 2016 at 1:13 PM, Michael Paquier <
michael.paquier@gmail.com> wrote:

+       /*
+        * Compute targetRecOff. It should typically be greater than short
+        * page-header since a valid record can't , but can also be zero
when
+        * caller has supplied a page-aligned address or when we are
skipping
+        * multi-page continuation record. It doesn't matter though
because
+        * ReadPageInternal() will read at least short page-header worth
of
+        * data
+        */
This needs some polishing, there is an unfinished sentence here.

+ targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
targetRecOff, pageHeaderSize and targetPagePtr could be declared
inside directly the new while loop.

Thanks Michael for reviewing the patch. I've fixed these issues and new
version is attached.

Looks sensible to me based on a reading of "git diff -w" of the applied
patch. It passes make check and make -C src/test/recovery check . Marked
ready for committer; I'd like to add a TAP test for it, but it's ready to
go without that.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Craig Ringer (#8)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Thu, Mar 31, 2016 at 6:27 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

Can you describe the process used to generate the sample WAL segment?

Shame that I can't find the sql file used to create the problematic WAL
segment. But this is what I did.

I wrote a plpgsql function which inserts rows in a loop, every time
checking if the remaining space in the WAL segment has fallen to less than
couple of hundred bytes. Of course, I used pg_settings to get the WAL
segment size, WAL page size and pg_current_xlog_insert_location() to
correctly compute remaining bytes in the WAL segment. At this point, do a
non-HOT update, preferably to table which is already fully vacuumed and
CHECKPOINTed to avoid getting any other WAL records in between. Assuming
FPW is turned ON, the UPDATE should generate enough WAL to cross over the
first page in the new WAL segment.

Let me know if you would like to me to put together a sql based on this
description.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11Andres Freund
andres@anarazel.de
In reply to: Pavan Deolasee (#10)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On 2016-03-31 09:41:46 +0530, Pavan Deolasee wrote:

On Thu, Mar 31, 2016 at 6:27 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

Can you describe the process used to generate the sample WAL segment?

Shame that I can't find the sql file used to create the problematic WAL
segment. But this is what I did.

It's probably easier to just generate a humongous commit record. You can
do so by having a *lot* of subtransactions. Relatively easy to do with
plpgsql by creating them in a loop (SELECT txid_current() in EXCEPTION
bearing block ought to suffice).

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#11)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On 31 March 2016 at 16:13, Andres Freund <andres@anarazel.de> wrote:

It's probably easier to just generate a humongous commit record. You can
do so by having a *lot* of subtransactions. Relatively easy to do with
plpgsql by creating them in a loop (SELECT txid_current() in EXCEPTION
bearing block ought to suffice).

This does the trick and does it quickly on 9.4:

CREATE TABLE trivial(padding text);
ALTER TABLE trivial ALTER COLUMN padding SET STORAGE plain;

DO
LANGUAGE plpgsql
$$
DECLARE
wal_seg_size integer;
remaining_size integer;
BEGIN
wal_seg_size := (select setting from pg_settings where name =
'wal_segment_size')::integer
* (select setting from pg_settings where name =
'wal_block_size')::integer;
LOOP
SELECT wal_seg_size - file_offset FROM
pg_xlogfile_name_offset(pg_current_xlog_insert_location()) INTO
remaining_size;
IF remaining_size < 8192
THEN
-- Make a nice big commit record right at the end of the segment
EXIT;
ELSE
BEGIN
-- About 4k
INSERT INTO trivial(padding) VALUES (repeat('0123456789abcdef',
256));
EXCEPTION
WHEN division_by_zero THEN
CONTINUE;
END;
END IF;
END LOOP;
END;
$$;

I had no issue reproducing the bug on 9.4, but I don't see it in 9.6. At
least, not this one, not yet.

Still, we might want to backpatch the fix; it seems clear enough even if I
can't reproduce the issue yet. It doesn't look like it'll affect logical
decoding since the snapshot builder has its own unrelated logic for finding
the startpoint for decoding, and it certainly doesn't affect WAL replay
otherwise we would've seen the fireworks much earlier. The upside is that
also makes the backpatch much safer.

I was surprised to see that there are no tests for pg_xlogdump to add this
on to, so I've added a trivial test. I've got some more complete
xlogdump-helper code in the failover slots tests that I should extract and
add to PostgresNode.pm but this'll do in the mean time and is much simpler.

In the process I noticed that pg_xlogdump doesn't follow the rules of the
rest of the codebase when it comes to command line handling with --version,
returning nonzero on bad options, etc. It might be good to fix that;
there's a small BC impact, but I doubt anyone cares when it comes to
pg_xlogdump.

I'll attach the new testcase once I either get it to reproduce this bug or
give up and leave the basic xlogdump testcase alone.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#12)
1 attachment(s)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On 1 April 2016 at 21:30, Craig Ringer <craig@2ndquadrant.com> wrote:

I'll attach the new testcase once I either get it to reproduce this bug or
give up and leave the basic xlogdump testcase alone.

I had another bash at this and I still can't reproduce it on master using
the giant commit record approach Andres suggested. In fact I generated a
commit record larger than an entire xlog segment and it was still fine.

The DO procedure I posted upthread, when run on 9.4, reliably produces
segments that the xlogreader cannot decode with the symptoms Pavan
reported. It's fine on 9.6.

So I can't reproduce this on 9.6, but there might be a separate bug on 9.4.

I've attached a patch with the simple tests I added for pg_xlogdump as part
of this. I doubt it'd be desirable to commit the ridiculous commit record
part, but that's trivially removed, and I left it in place in case someone
else wanted to fiddle with other ways to reproduce this.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Tests-for-pg_xlogdump.patchtext/x-patch; charset=US-ASCII; name=0001-Tests-for-pg_xlogdump.patchDownload
From bf05b552dd9eb5d2e9b50c77928426817033685d Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 6 Apr 2016 13:57:32 +0800
Subject: [PATCH] Tests for pg_xlogdump

---
 src/bin/pg_xlogdump/Makefile             |   6 +
 src/bin/pg_xlogdump/t/010_pg_xlogdump.pl | 249 +++++++++++++++++++++++++++++++
 2 files changed, 255 insertions(+)
 create mode 100644 src/bin/pg_xlogdump/t/010_pg_xlogdump.pl

diff --git a/src/bin/pg_xlogdump/Makefile b/src/bin/pg_xlogdump/Makefile
index 11df47d..c1138d2 100644
--- a/src/bin/pg_xlogdump/Makefile
+++ b/src/bin/pg_xlogdump/Makefile
@@ -38,3 +38,9 @@ uninstall:
 
 clean distclean maintainer-clean:
 	rm -f pg_xlogdump$(X) $(OBJS) $(RMGRDESCSOURCES) xlogreader.c
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_xlogdump/t/010_pg_xlogdump.pl b/src/bin/pg_xlogdump/t/010_pg_xlogdump.pl
new file mode 100644
index 0000000..93b2063
--- /dev/null
+++ b/src/bin/pg_xlogdump/t/010_pg_xlogdump.pl
@@ -0,0 +1,249 @@
+use strict;
+use warnings;
+use bigint qw(hex);
+use Cwd;
+use Config;
+use File::Basename qw(basename);
+use List::Util qw(minstr maxstr);
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Data::Dumper;
+
+my $verbose = 1;
+
+sub slurp_xlogdump_lines
+{
+	my ($node, $timeline, $firstwal, $lastwal) = @_;
+	my ($stdout, $stderr) = ('', '');
+
+	if (defined($firstwal) && $firstwal =~ /^[[:xdigit:]]{24}$/)
+	{
+		$firstwal = $node->data_dir . "/pg_xlog/" . $firstwal;
+	}
+
+	if (defined($lastwal) && $lastwal !~ /^[[:xdigit:]]{24}$/)
+	{
+		die("pg_xlogdump expects the last WAL seg to be a bare filename, not '$lastwal'");
+	}
+
+	if (!defined($firstwal) || !defined($lastwal))
+	{
+		my $wal_pattern = sprintf("%s/pg_xlog/%08X" . ("?" x 16), $node->data_dir, $timeline);
+		my @wal = glob $wal_pattern;
+		$firstwal = List::Util::minstr(@wal) if !defined($firstwal);
+		$lastwal = basename(List::Util::maxstr(@wal)) if !defined($lastwal);
+	}
+
+	diag("decoding from " . $firstwal . " to " . $lastwal)
+		if $verbose;
+
+	IPC::Run::run ['pg_xlogdump', $firstwal, $lastwal], '>', \$stdout, '2>', \$stderr;
+
+	like($stderr, qr/(?:invalid record length at [0-9A-F]+\/[0-9A-F]+: wanted 24, got 0|^\s*$)/,
+		'pg_xlogdump exits with expected error or silence');
+
+	diag "xlogdump exited with: '$stderr'"
+		if $verbose;
+
+	my $lineno = 1;
+
+	my @xld_lines = split("\n", $stdout);
+
+	return \@xld_lines;
+}
+
+sub match_xlogdump_lines
+{
+	my ($node, $timeline, $firstwal, $lastwal) = @_;
+
+	my $xld_lines = slurp_xlogdump_lines($node, $timeline, $firstwal, $lastwal);
+
+	my @records;
+	my $lineno = 1;
+
+	for my $xld_line (@$xld_lines)
+	{
+		chomp $xld_line;
+
+		# We don't use Test::More's 'like' here because it'd run a variable number
+		# of tests. Instead do our own diagnostics and fail.
+		if ($xld_line =~ /^rmgr: (\w+)\s+len \(rec\/tot\):\s*([[:digit:]]+)\/\s*([[:digit:]]+), tx:\s*([[:digit:]]*), lsn: ([[:xdigit:]]{1,8})\/([[:xdigit:]]{1,8}), prev ([[:xdigit:]]{1,8})\/([[:xdigit:]]{1,8}), desc: (.*)$/)
+		{
+			my %record = (
+				'_raw' => length($xld_line) >= 100 ? substr($xld_line,0,100) . "..." : $xld_line,
+				'rmgr' => $1,
+				'len_rec' => int($2),
+				'len_tot' => int($3),
+				'tx' => defined($4) ? int($4) : undef,
+				'lsn' => (hex($5)<<32) + hex($6),
+				'prev' => (hex($7)<<32) + hex($8),
+				'lsn_str' => "$5/$6",
+				'prev_str' => "$7/$8",
+				'desc' => length($9) >= 100 ? substr($9,0,100) . "..." : $9,
+			);
+
+			if ($record{'prev'} >= $record{'lsn'})
+			{
+				diag("in xlog record on line $lineno:\n    $xld_line\n   ... prev ptr $record{prev_str} is greater than or equal to lsn ptr $record{lsn_str}");
+				cmp_ok($record{prev}, 'lt', $record{lsn}, 'previous pointer less than lsn ptr');
+			}
+
+			push @records, \%record;
+		}
+		else
+		{
+			diag("xlog record on line $lineno:\n    $xld_line\n   ...does not match the test pattern");
+			fail("an xlog record didn't match the expected pattern");
+			die();
+		}
+
+		$lineno ++;
+	}
+
+	return \@records;
+}
+
+sub xlog_seg_next
+{
+	my ($segno) = @_;
+
+	my ($timeline, $ptr) = ($segno =~ /^([[:xdigit:]]{8})([[:xdigit:]]{16})$/);
+
+	die("invalid xlog segment name $segno")
+	 unless (defined($timeline) && defined($ptr));
+
+	return sprintf('%08X%016X', hex($timeline), hex($ptr) + 1);
+
+}
+
+sub rmgrs_for_records
+{
+	my ($xld_records) = @_;
+
+	my %rmgrs;
+	for my $record (@$xld_records)
+	{
+		$rmgrs{$record->{rmgr}} ++;
+	}
+	diag "Record counts for each resource manager were: " . Dumper(\%rmgrs)
+		if $verbose;
+
+	return \%rmgrs;
+}
+
+program_help_ok('pg_xlogdump');
+SKIP:
+{
+	skip 'xlogdump version and cmdline curently nonstandard';
+
+	program_version_ok('pg_xlogdump');
+	program_options_handling_ok('pg_xlogdump');
+}
+
+my $tempdir = TestLib::tempdir;
+
+my $node = get_new_node('main');
+
+$node->init();
+
+my $pgdata = $node->data_dir;
+open my $CONF, ">>$pgdata/postgresql.conf";
+print $CONF "wal_keep_segments = 5000\n";
+close $CONF;
+$node->start;
+
+my $first_xlog = '000000010000000000000001';
+
+my $startup_end_xlog = $node->safe_psql('postgres', "SELECT pg_xlogfile_name(pg_switch_xlog());");
+
+$node->safe_psql('postgres', "CREATE TABLE dummy(id serial primary key, blah text);");
+$node->safe_psql('postgres', "INSERT INTO dummy(blah) SELECT repeat('dummytext', x) FROM generate_series(1, 10) x;");
+$node->safe_psql('postgres', "CREATE INDEX dummy_blah ON dummy(blah COLLATE \"C\");");
+$node->safe_psql('postgres', "CHECKPOINT");
+my $simpletest_end_xlog = $node->safe_psql('postgres', "SELECT pg_xlogfile_name(pg_switch_xlog());");
+
+my $start_xid = $node->safe_psql('postgres', "SELECT txid_current()");
+
+$node->safe_psql('postgres', q{
+CREATE TABLE trivial(padding text);
+
+CREATE UNLOGGED TABLE trivial_ul(padding text);
+
+DO
+LANGUAGE plpgsql
+$$
+DECLARE
+  wal_seg_size integer;
+  remaining_size integer;
+BEGIN
+  wal_seg_size := (select setting from pg_settings where name = 'wal_segment_size')::integer
+  				* (select setting from pg_settings where name = 'wal_block_size')::integer;
+
+  -- Write WAL until we're near the end of a segment, and in the process
+  -- generate a bunch of subxacts.
+  LOOP
+    SELECT wal_seg_size - file_offset FROM pg_xlogfile_name_offset(pg_current_xlog_insert_location()) INTO remaining_size;
+	RAISE NOTICE 'remaining: %',remaining_size;
+    IF remaining_size < 4096
+	THEN
+		EXIT;
+	ELSE
+		BEGIN
+		  INSERT INTO trivial(padding) VALUES (repeat('0123456789abcdef', 64));
+		EXCEPTION
+		  WHEN division_by_zero THEN
+		    RAISE EXCEPTION 'unreachable';
+		END;
+	END IF;
+  END LOOP;
+
+  -- OK, we're near the end of an xlog segment. Time to do some more writes,
+  -- but this time into an unlogged table so we don't generate xlog but we
+  -- do burn xids. We need to make a big enough record to spill over.
+  FOR i IN 1 .. 10000000
+  LOOP
+    BEGIN
+      INSERT INTO trivial_ul(padding) VALUES ('x');
+    EXCEPTION
+      WHEN division_by_zero THEN
+        RAISE EXCEPTION 'unreachable';
+    END;
+  END LOOP;
+
+END;
+$$;
+});
+
+my $end_xid = $node->safe_psql('postgres', "SELECT txid_current()");
+my $bigrecord_end_xlog = $node->safe_psql('postgres', "SELECT pg_xlogfile_name(pg_switch_xlog());");
+
+$node->stop('fast');
+
+my $xld_records = match_xlogdump_lines($node, 1, $first_xlog, $startup_end_xlog);
+
+diag "xlogdump emitted " . scalar(@$xld_records) . " entries up to end of startup";
+
+my $rmgrs = rmgrs_for_records($xld_records);
+
+# Must be lexically sorted:
+my @expected_rmgr_names = qw(
+	Btree
+	CLOG
+	Database
+	Heap
+	Heap2
+	Storage
+	Transaction
+	XLOG
+);
+my @rmgr_names = sort(keys(%$rmgrs));
+is_deeply(\@rmgr_names, \@expected_rmgr_names, 'Got expected set of rmgrs in startup');
+
+$xld_records = match_xlogdump_lines($node, 1, xlog_seg_next($startup_end_xlog), $simpletest_end_xlog);
+pass('decoded simple test segment');
+
+$xld_records = match_xlogdump_lines($node, 1, xlog_seg_next($simpletest_end_xlog), $bigrecord_end_xlog);
+pass('decoded segment with ridiculous commit record');
+
+done_testing();
-- 
2.1.0

#14Fujii Masao
masao.fujii@gmail.com
In reply to: Pavan Deolasee (#3)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Wed, Mar 23, 2016 at 7:04 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Wed, Mar 23, 2016 at 1:13 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+       /*
+        * Compute targetRecOff. It should typically be greater than short
+        * page-header since a valid record can't , but can also be zero
when
+        * caller has supplied a page-aligned address or when we are
skipping
+        * multi-page continuation record. It doesn't matter though
because
+        * ReadPageInternal() will read at least short page-header worth
of
+        * data
+        */
This needs some polishing, there is an unfinished sentence here.

+ targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
targetRecOff, pageHeaderSize and targetPagePtr could be declared
inside directly the new while loop.

Thanks Michael for reviewing the patch. I've fixed these issues and new
version is attached.

The patch looks good to me. Barring any objections,
I'll push this and back-patch to 9.3 where pg_xlogdump was added.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#14)
Re: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data

On Fri, Aug 26, 2016 at 10:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Mar 23, 2016 at 7:04 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Wed, Mar 23, 2016 at 1:13 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+       /*
+        * Compute targetRecOff. It should typically be greater than short
+        * page-header since a valid record can't , but can also be zero
when
+        * caller has supplied a page-aligned address or when we are
skipping
+        * multi-page continuation record. It doesn't matter though
because
+        * ReadPageInternal() will read at least short page-header worth
of
+        * data
+        */
This needs some polishing, there is an unfinished sentence here.

+ targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
targetRecOff, pageHeaderSize and targetPagePtr could be declared
inside directly the new while loop.

Thanks Michael for reviewing the patch. I've fixed these issues and new
version is attached.

The patch looks good to me. Barring any objections,
I'll push this and back-patch to 9.3 where pg_xlogdump was added.

Pushed. Thanks!

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers