Possible gaps/garbage in the output of XLOG reader

Started by Antonin Houskaalmost 11 years ago3 messages
#1Antonin Houska
ah@cybertec.at
1 attachment(s)

While playing with xlogreader, I was lucky enough to see one of the many
record validations to fail. After having some fun with gdb, I found out that
in some cases the reader does not enforce enough data to be in state->readBuf
before copying into state->readRecordBuf starts. This should not happen if the
callback always reads XLOG_BLCKSZ bytes, but in fact only *reqLen* is the
mandatory size of the chunk delivered.

There are probably various ways to fix this problem. Attached is what I did in
my environment. I hit the problem on 9.4.1, but the patch seems to apply to
master too.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Attachments:

xlogreader_9.4.1.difftext/x-diffDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 474137a..e6ebd9d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -313,7 +313,21 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		goto err;
 	}
 
-	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+	/* Bytes of the current record residing on the current page. */
+	len = Min(XLOG_BLCKSZ - targetRecOff, total_len);
+
+	/*
+	 * Nothing beyond the record header is guaranteed to be in state->readBuf
+	 * so far.
+	 */
+	if (readOff < targetRecOff + len)
+	{
+		readOff = ReadPageInternal(state, targetPagePtr, targetRecOff + len);
+
+		if (readOff < 0)
+			goto err;
+	}
+
 	if (total_len > len)
 	{
 		/* Need to reassemble record */
@@ -322,9 +336,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		char	   *buffer;
 		uint32		gotlen;
 
+		Assert(readOff == targetRecOff + len);
+		Assert(readOff == XLOG_BLCKSZ);
+
 		/* Copy the first fragment of the record from the first page. */
-		memcpy(state->readRecordBuf,
-			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+		memcpy(state->readRecordBuf, state->readBuf + targetRecOff, len);
 		buffer = state->readRecordBuf + len;
 		gotlen = len;
 
@@ -413,20 +429,16 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	}
 	else
 	{
-		/* Wait for the record data to become available */
-		readOff = ReadPageInternal(state, targetPagePtr,
-								 Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff < 0)
-			goto err;
+		Assert(readOff >= targetRecOff + len);
 
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->EndRecPtr = RecPtr + MAXALIGN(len);
 
 		state->ReadRecPtr = RecPtr;
-		memcpy(state->readRecordBuf, record, total_len);
+		memcpy(state->readRecordBuf, record, len);
 	}
 
 	/*
#2Michael Paquier
michael.paquier@gmail.com
In reply to: Antonin Houska (#1)
Re: Possible gaps/garbage in the output of XLOG reader

On Thu, Apr 9, 2015 at 7:05 PM, Antonin Houska <ah@cybertec.at> wrote:

While playing with xlogreader, I was lucky enough to see one of the many
record validations to fail. After having some fun with gdb, I found out that
in some cases the reader does not enforce enough data to be in state->readBuf
before copying into state->readRecordBuf starts. This should not happen if the
callback always reads XLOG_BLCKSZ bytes, but in fact only *reqLen* is the
mandatory size of the chunk delivered.

There are probably various ways to fix this problem. Attached is what I did in
my environment. I hit the problem on 9.4.1, but the patch seems to apply to
master too.

This looks similar to what is discussed here:
/messages/by-id/CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com
And there is a different patch which takes a lower-level approach, and
it seems to me cleaner approach in its way of calculating the record
offset when it goes through several XLOG pages.

Perhaps you could help review it? It is attached to the next commit fest.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Antonin Houska
ah@cybertec.at
In reply to: Michael Paquier (#2)
2 attachment(s)
Re: [HACKERS] Possible gaps/garbage in the output of XLOG reader

Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Apr 9, 2015 at 7:05 PM, Antonin Houska <ah@cybertec.at> wrote:

While playing with xlogreader, I was lucky enough to see one of the many
record validations to fail. After having some fun with gdb, I found out that
in some cases the reader does not enforce enough data to be in state->readBuf
before copying into state->readRecordBuf starts. This should not happen if the
callback always reads XLOG_BLCKSZ bytes, but in fact only *reqLen* is the
mandatory size of the chunk delivered.

There are probably various ways to fix this problem. Attached is what I did in
my environment. I hit the problem on 9.4.1, but the patch seems to apply to
master too.

This looks similar to what is discussed here:
/messages/by-id/CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com
And there is a different patch which takes a lower-level approach, and
it seems to me cleaner approach in its way of calculating the record
offset when it goes through several XLOG pages.

Perhaps you could help review it? It is attached to the next commit fest.

I don't see an overlap. The problem I reported is still there. Attached is a
patch for the current HEAD.

Attached is also a patch that I use to simulate the problem. It includes this
hunk

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index e42b828..e7b56b6
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** retry:
*** 11648,11654 ****
        }

pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
! if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];

--- 11648,11654 ----
        }

pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
! if (read(readFile, readBuf, readLen) != readLen)
{
char fname[MAXFNAMELEN];

that I posted in [1]/messages/by-id/16062.1516189589@localhost. When the reproduce.diff was applied (but not
xlogreader_readbuf_gap.diff), I could fire the contained Assert(readOff ==
XLOG_BLCKSZ) statement this way:

1. Create a new cluster.

2. Create another one and setup streaming replication.

3. Run regression test on master.

After some time the Assert() statement fired on the standby.

There are probably simpler scenarios but here I want to stress that the
problem is related to [1]/messages/by-id/16062.1516189589@localhost.

[1]: /messages/by-id/16062.1516189589@localhost

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Attachments:

xlogreader_readbuf_gap.difftext/x-diffDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 3a86f34..764a579
*** a/src/backend/access/transam/xlogreader.c
--- b/src/backend/access/transam/xlogreader.c
*************** XLogReadRecord(XLogReaderState *state, X
*** 345,351 ****
  		goto err;
  	}
  
! 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
  	if (total_len > len)
  	{
  		/* Need to reassemble record */
--- 345,365 ----
  		goto err;
  	}
  
! 	/* Bytes of the current record residing on the current page. */
! 	len = Min(XLOG_BLCKSZ - targetRecOff, total_len);
! 
! 	/*
! 	 * Nothing beyond the record header is guaranteed to be in state->readBuf
! 	 * so far.
! 	 */
! 	if (readOff < targetRecOff + len)
! 	{
! 		readOff = ReadPageInternal(state, targetPagePtr, targetRecOff + len);
! 
! 		if (readOff < 0)
! 			goto err;
! 	}
! 
  	if (total_len > len)
  	{
  		/* Need to reassemble record */
*************** XLogReadRecord(XLogReaderState *state, X
*** 354,362 ****
  		char	   *buffer;
  		uint32		gotlen;
  
  		/* Copy the first fragment of the record from the first page. */
! 		memcpy(state->readRecordBuf,
! 			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
  		buffer = state->readRecordBuf + len;
  		gotlen = len;
  
--- 368,378 ----
  		char	   *buffer;
  		uint32		gotlen;
  
+ 		Assert(readOff == targetRecOff + len);
+ 		Assert(readOff == XLOG_BLCKSZ);
+ 
  		/* Copy the first fragment of the record from the first page. */
! 		memcpy(state->readRecordBuf, state->readBuf + targetRecOff, len);
  		buffer = state->readRecordBuf + len;
  		gotlen = len;
  
*************** XLogReadRecord(XLogReaderState *state, X
*** 445,464 ****
  	}
  	else
  	{
! 		/* Wait for the record data to become available */
! 		readOff = ReadPageInternal(state, targetPagePtr,
! 								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
! 		if (readOff < 0)
! 			goto err;
  
  		/* Record does not cross a page boundary */
  		if (!ValidXLogRecord(state, record, RecPtr))
  			goto err;
  
! 		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
  
  		state->ReadRecPtr = RecPtr;
! 		memcpy(state->readRecordBuf, record, total_len);
  	}
  
  	/*
--- 461,480 ----
  	}
  	else
  	{
! 		/*
! 		 * The whole record is on the current page and should have been
! 		 * fetched into readBuf.
! 		 */
! 		Assert(readOff >= targetRecOff + len);
  
  		/* Record does not cross a page boundary */
  		if (!ValidXLogRecord(state, record, RecPtr))
  			goto err;
  
! 		state->EndRecPtr = RecPtr + MAXALIGN(len);
  
  		state->ReadRecPtr = RecPtr;
! 		memcpy(state->readRecordBuf, record, len);
  	}
  
  	/*
reproduce.difftext/x-diffDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index e42b828..e7b56b6
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** retry:
*** 11648,11654 ****
  	}
  
  	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
! 	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  	{
  		char		fname[MAXFNAMELEN];
  
--- 11648,11654 ----
  	}
  
  	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
! 	if (read(readFile, readBuf, readLen) != readLen)
  	{
  		char		fname[MAXFNAMELEN];
  
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 3a86f34..12d38f0
*** a/src/backend/access/transam/xlogreader.c
--- b/src/backend/access/transam/xlogreader.c
*************** XLogReadRecord(XLogReaderState *state, X
*** 354,359 ****
--- 354,365 ----
  		char	   *buffer;
  		uint32		gotlen;
  
+ 		/*
+ 		 * Since we're going to copy the rest of the current page, the whole
+ 		 * page should be in readBuf.
+ 		 */
+ 		Assert(readOff == XLOG_BLCKSZ);
+ 
  		/* Copy the first fragment of the record from the first page. */
  		memcpy(state->readRecordBuf,
  			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index 130ecd5..9f2d40b
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 103,109 ****
   * because signals are checked only between messages.  128kB (with
   * default 8k blocks) seems like a reasonable guess for now.
   */
! #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
  
  /* Array of WalSnds in shared memory */
  WalSndCtlData *WalSndCtl = NULL;
--- 103,109 ----
   * because signals are checked only between messages.  128kB (with
   * default 8k blocks) seems like a reasonable guess for now.
   */
! #define MAX_SEND_SIZE (XLOG_BLCKSZ / 2)
  
  /* Array of WalSnds in shared memory */
  WalSndCtlData *WalSndCtl = NULL;
*************** XLogSendPhysical(void)
*** 2688,2694 ****
  	else
  	{
  		/* round down to page boundary. */
! 		endptr -= (endptr % XLOG_BLCKSZ);
  		WalSndCaughtUp = false;
  	}
  
--- 2688,2694 ----
  	else
  	{
  		/* round down to page boundary. */
! 		//endptr -= (endptr % XLOG_BLCKSZ);
  		WalSndCaughtUp = false;
  	}