pgsql: Add contrib/pg_walinspect.

[1]: /messages/by-id/111657.1650910309@sss.pgh.pa.us

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Michael Paquier (#2)

Re: pgsql: Add contrib/pg_walinspect.

Michael Paquier <michael@paquier.xyz> writes:

Could this be an issue related to the addition of the circular WAL
decoding buffer, aka 3f1ce97?

I already whined about this at [1]/messages/by-id/111657.1650910309@sss.pgh.pa.us.

I've been wondering if the issue could be traced to topminnow's unusual
hardware properties, specifically that it has MAXALIGN 8 even though
it's only a 32-bit machine per sizeof(void *). I think the only
other active buildfarm animal like that is my gaur ... but I've
failed to reproduce it on gaur. Best guess at the moment is that
it's a timing issue that topminnow manages to reproduce often.

Anyway, as I said in the other thread, I can reproduce it on
topminnow's host. Let me know if you have ideas about how to
home in on the cause.

regards, tom lane

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Tom Lane (#3)

Re: pgsql: Add contrib/pg_walinspect.

On Tue, Apr 26, 2022 at 01:25:14AM -0400, Tom Lane wrote:

I've been wondering if the issue could be traced to topminnow's unusual
hardware properties, specifically that it has MAXALIGN 8 even though
it's only a 32-bit machine per sizeof(void *). I think the only
other active buildfarm animal like that is my gaur ... but I've
failed to reproduce it on gaur. Best guess at the moment is that
it's a timing issue that topminnow manages to reproduce often.

I have managed to miss your message. Let's continue the discussion
there, then.

Anyway, as I said in the other thread, I can reproduce it on
topminnow's host. Let me know if you have ideas about how to
home in on the cause.

Nice. I have not been able to do so, but based on the lack of
reports from the buildfarm, that's not surprising.
--
Michael

thomas.munro@gmail.com

over 3 years ago

In reply to: Michael Paquier (#4)

1 attachment(s)

Re: pgsql: Add contrib/pg_walinspect.

On Tue, Apr 26, 2022 at 5:36 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Apr 26, 2022 at 01:25:14AM -0400, Tom Lane wrote:

I've been wondering if the issue could be traced to topminnow's unusual
hardware properties, specifically that it has MAXALIGN 8 even though
it's only a 32-bit machine per sizeof(void *). I think the only
other active buildfarm animal like that is my gaur ... but I've
failed to reproduce it on gaur. Best guess at the moment is that
it's a timing issue that topminnow manages to reproduce often.

I have managed to miss your message. Let's continue the discussion
there, then.

I think it's a bug in pg_walinspect, so I'll move the discussion back
here. Here's one rather simple way to fix it, that has survived
running the test a thousand times (using a recipe that failed for me
quite soon, after 20-100 attempts or so; I never figured out how to
get the 50% failure rate reported by Tom). Explanation in commit
message. You can see that the comments near the first hunk already
contemplated this possibility, but just didn't try to handle it.

Another idea that I slept on, but rejected, is that the new WOULDBLOCK
return value introduced to support WAL prefetching could be used here
(it's a way of reporting a lack of data, different from errors).
Unfortunately it's not exposed to the XLogReadRecord() interface, as I
only intended it for use by XLogReadAhead(). I don't really think
it's a good idea to redesign that API at this juncture.

Maybe there is some other way I haven't considered -- is there a way
to get the LSN past the latest whole flushed record from shmem?

Attachments:

0001-Fix-pg_walinspect-race-against-flush-LSN.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-pg_walinspect-race-against-flush-LSN.patchDownload

From d78ae8fcac7195d2cc967bcde754d1ce1fa39321 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 27 Apr 2022 11:39:10 +1200
Subject: [PATCH] Fix pg_walinspect race against flush LSN.

Instability in the TAP test for pg_walinspect revealed that
pg_get_wal_records_info_till_end_of_wal(x) would try to decode all the
records beginning in the range [x, flush LSN), even though that might
include a partial record at the end of the range.  In that case, an
error would be reported by read_local_xlog_page_no_wait when it tried to
read past the flush LSN.  That caused a test failure only on a BF
animal that had been restarted recently, but could be expected to happen
in the wild quite easily depending on the alignment of various
parameters.

Adjust pg_walinspect the tolerate errors without messages, as a way to
discover the end of the WAL region to decode.

Discussion: https://postgr.es/m/Ymd/e5eeZMNAkrXo%40paquier.xyz
Discussion: https://postgr.es/m/111657.1650910309@sss.pgh.pa.us
---
 contrib/pg_walinspect/pg_walinspect.c | 51 +++++++++++----------------
 1 file changed, 21 insertions(+), 30 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index bf38863ff1..10507c0c0c 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -133,6 +133,7 @@ InitXLogReaderState(XLogRecPtr lsn, XLogRecPtr *first_record)
  * We guard against ordinary errors trying to read WAL that hasn't been
  * written yet by limiting end_lsn to the flushed WAL, but that can also
  * encounter errors if the flush pointer falls in the middle of a record.
+ * In that case we'll return NULL.
  */
 static XLogRecord *
 ReadNextXLogRecord(XLogReaderState *xlogreader, XLogRecPtr first_record)
@@ -149,11 +150,11 @@ ReadNextXLogRecord(XLogReaderState *xlogreader, XLogRecPtr first_record)
 					(errcode_for_file_access(),
 					 errmsg("could not read WAL at %X/%X: %s",
 							LSN_FORMAT_ARGS(first_record), errormsg)));
-		else
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read WAL at %X/%X",
-							LSN_FORMAT_ARGS(first_record))));
+
+		/*
+		 * With no error message, assume that read_local_xlog_page_no_wait
+		 * tried to read past the flush LSN and reported an error.
+		 */
 	}
 
 	return record;
@@ -246,7 +247,12 @@ pg_get_wal_record_info(PG_FUNCTION_ARGS)
 
 	xlogreader = InitXLogReaderState(lsn, &first_record);
 
-	(void) ReadNextXLogRecord(xlogreader, first_record);
+	if (!ReadNextXLogRecord(xlogreader, first_record))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("could not read WAL at %X/%X",
+						LSN_FORMAT_ARGS(first_record))));
+
 
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
@@ -327,22 +333,14 @@ GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
 
-	for (;;)
+	while (ReadNextXLogRecord(xlogreader, first_record) &&
+		   xlogreader->EndRecPtr <= end_lsn)
 	{
-		(void) ReadNextXLogRecord(xlogreader, first_record);
+		GetWALRecordInfo(xlogreader, xlogreader->currRecPtr, values, nulls,
+						 PG_GET_WAL_RECORDS_INFO_COLS);
 
-		if (xlogreader->EndRecPtr <= end_lsn)
-		{
-			GetWALRecordInfo(xlogreader, xlogreader->currRecPtr, values, nulls,
-							 PG_GET_WAL_RECORDS_INFO_COLS);
-
-			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
-								 values, nulls);
-		}
-
-		/* if we read up to end_lsn, we're done */
-		if (xlogreader->EndRecPtr >= end_lsn)
-			break;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
 
 		CHECK_FOR_INTERRUPTS();
 	}
@@ -555,17 +553,10 @@ GetWalStats(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
 
 	MemSet(&stats, 0, sizeof(stats));
 
-	for (;;)
+	while (ReadNextXLogRecord(xlogreader, first_record) &&
+		   xlogreader->EndRecPtr <= end_lsn)
 	{
-		(void) ReadNextXLogRecord(xlogreader, first_record);
-
-		if (xlogreader->EndRecPtr <= end_lsn)
-			XLogRecStoreStats(&stats, xlogreader);
-
-		/* if we read up to end_lsn, we're done */
-		if (xlogreader->EndRecPtr >= end_lsn)
-			break;
-
+		XLogRecStoreStats(&stats, xlogreader);
 		CHECK_FOR_INTERRUPTS();
 	}
 
-- 
2.35.1

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#5)

Re: pgsql: Add contrib/pg_walinspect.

Thomas Munro <thomas.munro@gmail.com> writes:

I think it's a bug in pg_walinspect, so I'll move the discussion back
here. Here's one rather simple way to fix it, that has survived
running the test a thousand times (using a recipe that failed for me
quite soon, after 20-100 attempts or so; I never figured out how to
get the 50% failure rate reported by Tom).

Not sure what we're doing differently, but plain "make check" in
contrib/pg_walinspect fails pretty consistently for me on gcc23.
I tried it again just now and got five failures in five attempts.

I then installed your patch and got the same failure, three times
out of three, so I don't think we're there yet.

Again, since I do have this problem in captivity, I'm happy
to spend some time poking at it. But I could use a little
guidance where to poke, because I've not looked at any of
the WAL prefetch stuff.

regards, tom lane

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#6)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, Apr 27, 2022 at 12:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I think it's a bug in pg_walinspect, so I'll move the discussion back
here. Here's one rather simple way to fix it, that has survived
running the test a thousand times (using a recipe that failed for me
quite soon, after 20-100 attempts or so; I never figured out how to
get the 50% failure rate reported by Tom).

Not sure what we're doing differently, but plain "make check" in
contrib/pg_walinspect fails pretty consistently for me on gcc23.
I tried it again just now and got five failures in five attempts.

I tried on the /home filesystem (a slow NFS mount) and then inside a
directory on /tmp to get ext4 (I saw that Noah had somehow got onto a
local filesystem, based on the present of "ext4" in the pathname and I
was trying everything I could think of). I used what I thought might
be some relevant starter configure options copied from the animal:

./configure --prefix=$HOME/install --enable-cassert --enable-debug
--enable-tap-tests CC="ccache gcc -mips32r2" CFLAGS="-O2
-funwind-tables" LDFLAGS="-rdynamic"

For me, make check always succeeds in contrib/pg_walinspect. For me,
make installcheck fails if I do it enough times in a loop, somewhere
around the 20th loop or so, which I imagine has to do with WAL page
boundaries moving around.

for i in `seq 1 1000` ; do
make -s installcheck || exit 1
done

I then installed your patch and got the same failure, three times
out of three, so I don't think we're there yet.

Hrmph... Are you sure you rebuilt the contrib module? Assuming so,
maybe it's failing in a different way for you and me. For me, it
always fails after this break is reached in xlogutil.c:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
break;

If you add a log message there, do you see that? For me, the patch
fixes it, because it teaches pg_walinspect that messageless errors are
a way of detecting end-of-data (due to the code above, introduced by
the pg_walinspect commit).

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#7)

2 attachment(s)

Re: pgsql: Add contrib/pg_walinspect.

Thomas Munro <thomas.munro@gmail.com> writes:

Hrmph... Are you sure you rebuilt the contrib module? Assuming so,
maybe it's failing in a different way for you and me. For me, it
always fails after this break is reached in xlogutil.c:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
break;

Hmm. For me, that statement is not reached at all in successful
(make installcheck) runs. In a failing run, it's reached with
wait_for_wal = false, after which we get the "could not read WAL"
failure. Usually that happens twice, as per attached.

regards, tom lane

Attachments:

debug.patchtext/x-diff; charset=us-ascii; name=debug.patchDownload

diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile
index 960530e..85dcb3a 100644
--- a/contrib/pg_walinspect/Makefile
+++ b/contrib/pg_walinspect/Makefile
@@ -15,7 +15,6 @@ REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf
 
 # Disabled because these tests require "wal_level=replica", which
 # some installcheck users do not have (e.g. buildfarm clients).
-NO_INSTALLCHECK = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23..33d2c73 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -373,6 +373,9 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 		 */
 		Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));
 
+		if(!*errormsg)
+		  *errormsg = "decode_queue_head is null";
+		
 		return NULL;
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4257026..4089efd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -955,6 +955,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			if (loc <= read_upto)
 				break;
 
+			elog(LOG, "let's wait for wal: %d", wait_for_wal);
+
 			/* If asked, let's not wait for future WAL. */
 			if (!wait_for_wal)
 				break;

fail.logtext/plain; charset=us-ascii; name=fail.logDownload

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#8)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, Apr 27, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Hrmph... Are you sure you rebuilt the contrib module? Assuming so,
maybe it's failing in a different way for you and me. For me, it
always fails after this break is reached in xlogutil.c:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
break;

Hmm. For me, that statement is not reached at all in successful
(make installcheck) runs. In a failing run, it's reached with
wait_for_wal = false, after which we get the "could not read WAL"
failure. Usually that happens twice, as per attached.

Ok, that's the same for me. Next question: why does the patch I
posted not help? For me, the error "could not read WAL at %X/%X",
seen on the BF log, is raised by ReadNextXLogRecord() in
pg_walinspect.c. The patch removes that ereport() entirely (and
handles NULL in a couple of places).

#10

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#7)

Re: pgsql: Add contrib/pg_walinspect.

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Apr 27, 2022 at 12:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Not sure what we're doing differently, but plain "make check" in
contrib/pg_walinspect fails pretty consistently for me on gcc23.
I tried it again just now and got five failures in five attempts.

I tried on the /home filesystem (a slow NFS mount) and then inside a
directory on /tmp to get ext4 (I saw that Noah had somehow got onto a
local filesystem, based on the present of "ext4" in the pathname and I
was trying everything I could think of). I used what I thought might
be some relevant starter configure options copied from the animal:

./configure --prefix=$HOME/install --enable-cassert --enable-debug
--enable-tap-tests CC="ccache gcc -mips32r2" CFLAGS="-O2
-funwind-tables" LDFLAGS="-rdynamic"

Hmph. I'm also running it on the /home filesystem, and I used
these settings:

export CC="ccache gcc -mips32r2"
export CFLAGS="-O2 -funwind-tables"
export LDFLAGS="-rdynamic"

./configure --enable-debug --enable-cassert --with-systemd --enable-nls --with-icu --enable-tap-tests --with-system-tzdata=/usr/share/zoneinfo

plus uninteresting stuff like --prefix. Now maybe some of these
other options affect this, but I'd be pretty surprised if so.
So I'm at a loss why it behaves differently for you.

regards, tom lane

#11

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#9)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, Apr 27, 2022 at 1:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Apr 27, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Hrmph... Are you sure you rebuilt the contrib module? Assuming so,
maybe it's failing in a different way for you and me. For me, it
always fails after this break is reached in xlogutil.c:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
break;

Hmm. For me, that statement is not reached at all in successful
(make installcheck) runs. In a failing run, it's reached with
wait_for_wal = false, after which we get the "could not read WAL"
failure. Usually that happens twice, as per attached.

Ok, that's the same for me. Next question: why does the patch I
posted not help? For me, the error "could not read WAL at %X/%X",
seen on the BF log, is raised by ReadNextXLogRecord() in
pg_walinspect.c. The patch removes that ereport() entirely (and
handles NULL in a couple of places).

BTW If you had your local change from debug.patch (upthread), that'd
defeat the patch. I mean this:

+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";

#12

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#9)

3 attachment(s)

Re: pgsql: Add contrib/pg_walinspect.

Thomas Munro <thomas.munro@gmail.com> writes:

Ok, that's the same for me. Next question: why does the patch I
posted not help?

I improved the instrumentation a bit, and it looks like what is
happening is that loc > read_upto, causing that code to "break"
independently of wait_for_wal. success.log is from "make installcheck"
immediately after initdb; fail.log is from "make check".

regards, tom lane

Attachments:

debug.patchtext/x-diff; charset=us-ascii; name=debug.patchDownload

diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile
index 960530e..85dcb3a 100644
--- a/contrib/pg_walinspect/Makefile
+++ b/contrib/pg_walinspect/Makefile
@@ -15,7 +15,6 @@ REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf
 
 # Disabled because these tests require "wal_level=replica", which
 # some installcheck users do not have (e.g. buildfarm clients).
-NO_INSTALLCHECK = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23..33d2c73 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -373,6 +373,9 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 		 */
 		Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));
 
+		if(!*errormsg)
+		  *errormsg = "decode_queue_head is null";
+		
 		return NULL;
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4257026..5ea59a3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -951,6 +951,9 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
+		  elog(LOG, "considering wait for wal: loc %X/%X read_upto %X/%X wait_for_wal %d",
+		       LSN_FORMAT_ARGS(loc), LSN_FORMAT_ARGS(read_upto),
+		       wait_for_wal);
 
 			if (loc <= read_upto)
 				break;

success.logtext/plain; charset=us-ascii; name=success.logDownload

fail.logtext/plain; charset=us-ascii; name=fail.logDownload

#13

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#11)

Re: pgsql: Add contrib/pg_walinspect.

Thomas Munro <thomas.munro@gmail.com> writes:

BTW If you had your local change from debug.patch (upthread), that'd
defeat the patch. I mean this:

+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";

Oh! Okay, I'll retry without that.

regards, tom lane

#14

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#13)

Re: pgsql: Add contrib/pg_walinspect.

I wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

BTW If you had your local change from debug.patch (upthread), that'd
defeat the patch. I mean this:

+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";

Oh! Okay, I'll retry without that.

I've now done several runs with your patch and not seen the test failure.
However, I think we ought to rethink this API a bit rather than just
apply the patch as-is. Even if it were documented, relying on
errormsg = NULL to mean something doesn't seem like a great plan.

regards, tom lane

#15

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Tom Lane (#14)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, Apr 27, 2022 at 8:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

BTW If you had your local change from debug.patch (upthread), that'd
defeat the patch. I mean this:
+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";
Oh! Okay, I'll retry without that.

I've now done several runs with your patch and not seen the test failure.
However, I think we ought to rethink this API a bit rather than just
apply the patch as-is. Even if it were documented, relying on
errormsg = NULL to mean something doesn't seem like a great plan.

Sorry for being late in the game, occupied with other stuff.

How about using private_data of XLogReaderState for
read_local_xlog_page_no_wait, something like this?

typedef struct ReadLocalXLogPageNoWaitPrivate
{
bool end_of_wal;
} ReadLocalXLogPageNoWaitPrivate;

In read_local_xlog_page_no_wait:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
{
private_data->end_of_wal = true;
break;
}

/*
* Opaque data for callbacks to use. Not used by XLogReader.
*/
void *private_data;

Regards,
Bharath Rupireddy.

#16

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#15)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, Apr 27, 2022 at 8:57 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Apr 27, 2022 at 8:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

BTW If you had your local change from debug.patch (upthread), that'd
defeat the patch. I mean this:
+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";
Oh! Okay, I'll retry without that.

I've now done several runs with your patch and not seen the test failure.
However, I think we ought to rethink this API a bit rather than just
apply the patch as-is. Even if it were documented, relying on
errormsg = NULL to mean something doesn't seem like a great plan.
Sorry for being late in the game, occupied with other stuff.

How about using private_data of XLogReaderState for
read_local_xlog_page_no_wait, something like this?

typedef struct ReadLocalXLogPageNoWaitPrivate
{
bool end_of_wal;
} ReadLocalXLogPageNoWaitPrivate;

In read_local_xlog_page_no_wait:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
{
private_data->end_of_wal = true;
break;
}

/*
* Opaque data for callbacks to use. Not used by XLogReader.
*/
void *private_data;

I found an easy way to reproduce this consistently (I think on any server):

I basically generated huge WAL record (I used a fun extension that I
wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
use pg_logical_emit_message as well)
session 1:
select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB
of WAL record first and make a note of the output lsn (lsn1)

session 2:
select * from pg_get_wal_records_info_till_end_of_wal(lsn1);
\watch 1

session 1:
select * from pg_synthesize_wal_record(1000*1024*1024); --> generate
~1 GB of WAL record and we see ERROR: could not read WAL at XXXXX in
session 2.

Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle
the wal files while we run pg_walinspect functions, no other changes
required from the default initdb settings on the server.

And, Thomas's patch fixes the issue.

Regards,
Bharath Rupireddy.

#17

[1]: /messages/by-id/CA+hUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA@mail.gmail.com

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#16)

1 attachment(s)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, Apr 27, 2022 at 1:47 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I've now done several runs with your patch and not seen the test failure.
However, I think we ought to rethink this API a bit rather than just
apply the patch as-is. Even if it were documented, relying on
errormsg = NULL to mean something doesn't seem like a great plan.

Sorry for being late in the game, occupied with other stuff.

How about using private_data of XLogReaderState for
read_local_xlog_page_no_wait, something like this?

typedef struct ReadLocalXLogPageNoWaitPrivate
{
bool end_of_wal;
} ReadLocalXLogPageNoWaitPrivate;

In read_local_xlog_page_no_wait:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
{
private_data->end_of_wal = true;
break;
}

/*
* Opaque data for callbacks to use. Not used by XLogReader.
*/
void *private_data;

I found an easy way to reproduce this consistently (I think on any server):

I basically generated huge WAL record (I used a fun extension that I
wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
use pg_logical_emit_message as well)
session 1:
select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB
of WAL record first and make a note of the output lsn (lsn1)

session 2:
select * from pg_get_wal_records_info_till_end_of_wal(lsn1);
\watch 1

session 1:
select * from pg_synthesize_wal_record(1000*1024*1024); --> generate
~1 GB of WAL record and we see ERROR: could not read WAL at XXXXX in
session 2.

Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle
the wal files while we run pg_walinspect functions, no other changes
required from the default initdb settings on the server.

And, Thomas's patch fixes the issue.

Here's v2 patch (up on Thomas's v1 at [1]/messages/by-id/CA+hUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA@mail.gmail.com) using private_data to set
the end of the WAL flag. Please have a look at it.

Regards,
Bharath Rupireddy.

Attachments:

v2-0001-Fix-pg_walinspect-race-against-flush-LSN.patchapplication/octet-stream; name=v2-0001-Fix-pg_walinspect-race-against-flush-LSN.patchDownload

From 01d0c6e440d344a06475bccbb6d31373957c7ad7 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 27 Apr 2022 10:17:11 +0000
Subject: [PATCH v2] Fix pg_walinspect race against flush LSN.

Instability in the TAP test for pg_walinspect revealed that
pg_get_wal_records_info_till_end_of_wal(x) would try to decode all the
records beginning in the range [x, flush LSN), even though that might
include a partial record at the end of the range.  In that case, an
error would be reported by read_local_xlog_page_no_wait when it tried to
read past the flush LSN.  That caused a test failure only on a BF
animal that had been restarted recently, but could be expected to happen
in the wild quite easily depending on the alignment of various
parameters.

Adjust pg_walinspect the tolerate errors without messages, as a way to
discover the end of the WAL region to decode.

Discussion: https://postgr.es/m/Ymd/e5eeZMNAkrXo%40paquier.xyz
Discussion: https://postgr.es/m/111657.1650910309@sss.pgh.pa.us

Authors: Thomas Munro, Bharath Rupireddy.
---
 contrib/pg_walinspect/pg_walinspect.c  | 64 +++++++++++++-------------
 src/backend/access/transam/xlogutils.c | 12 +++++
 src/include/access/xlogutils.h         |  6 +++
 3 files changed, 51 insertions(+), 31 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index bf38863ff1..88b15eb753 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -89,6 +89,7 @@ static XLogReaderState *
 InitXLogReaderState(XLogRecPtr lsn, XLogRecPtr *first_record)
 {
 	XLogReaderState *xlogreader;
+	ReadLocalXLogPageNoWaitPrivate *private_data;
 
 	/*
 	 * Reading WAL below the first page of the first segments isn't allowed.
@@ -100,11 +101,14 @@ InitXLogReaderState(XLogRecPtr lsn, XLogRecPtr *first_record)
 				(errmsg("could not read WAL at LSN %X/%X",
 						LSN_FORMAT_ARGS(lsn))));
 
+	private_data = (ReadLocalXLogPageNoWaitPrivate *)
+						palloc0(sizeof(ReadLocalXLogPageNoWaitPrivate));
+
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page_no_wait,
 											   .segment_open = &wal_segment_open,
 											   .segment_close = &wal_segment_close),
-									NULL);
+									private_data);
 
 	if (xlogreader == NULL)
 		ereport(ERROR,
@@ -132,7 +136,8 @@ InitXLogReaderState(XLogRecPtr lsn, XLogRecPtr *first_record)
  *
  * We guard against ordinary errors trying to read WAL that hasn't been
  * written yet by limiting end_lsn to the flushed WAL, but that can also
- * encounter errors if the flush pointer falls in the middle of a record.
+ * encounter errors if the flush pointer falls in the middle of a record. In
+ * that case we'll return NULL.
  */
 static XLogRecord *
 ReadNextXLogRecord(XLogReaderState *xlogreader, XLogRecPtr first_record)
@@ -144,16 +149,20 @@ ReadNextXLogRecord(XLogReaderState *xlogreader, XLogRecPtr first_record)
 
 	if (record == NULL)
 	{
+		ReadLocalXLogPageNoWaitPrivate *private_data;
+
+		/* return NULL, if end of WAL is reached */
+		private_data = (ReadLocalXLogPageNoWaitPrivate *)
+							xlogreader->private_data;
+
+		if (private_data->end_of_wal)
+			return NULL;
+
 		if (errormsg)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read WAL at %X/%X: %s",
 							LSN_FORMAT_ARGS(first_record), errormsg)));
-		else
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read WAL at %X/%X",
-							LSN_FORMAT_ARGS(first_record))));
 	}
 
 	return record;
@@ -246,7 +255,11 @@ pg_get_wal_record_info(PG_FUNCTION_ARGS)
 
 	xlogreader = InitXLogReaderState(lsn, &first_record);
 
-	(void) ReadNextXLogRecord(xlogreader, first_record);
+	if (!ReadNextXLogRecord(xlogreader, first_record))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("could not read WAL at %X/%X",
+						LSN_FORMAT_ARGS(first_record))));
 
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
@@ -254,6 +267,7 @@ pg_get_wal_record_info(PG_FUNCTION_ARGS)
 	GetWALRecordInfo(xlogreader, first_record, values, nulls,
 					 PG_GET_WAL_RECORD_INFO_COLS);
 
+	pfree(xlogreader->private_data);
 	XLogReaderFree(xlogreader);
 
 	tuple = heap_form_tuple(tupdesc, values, nulls);
@@ -327,26 +341,19 @@ GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
 
-	for (;;)
+	while (ReadNextXLogRecord(xlogreader, first_record) &&
+		   xlogreader->EndRecPtr <= end_lsn)
 	{
-		(void) ReadNextXLogRecord(xlogreader, first_record);
+		GetWALRecordInfo(xlogreader, xlogreader->currRecPtr, values, nulls,
+						 PG_GET_WAL_RECORDS_INFO_COLS);
 
-		if (xlogreader->EndRecPtr <= end_lsn)
-		{
-			GetWALRecordInfo(xlogreader, xlogreader->currRecPtr, values, nulls,
-							 PG_GET_WAL_RECORDS_INFO_COLS);
-
-			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
-								 values, nulls);
-		}
-
-		/* if we read up to end_lsn, we're done */
-		if (xlogreader->EndRecPtr >= end_lsn)
-			break;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
 
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	pfree(xlogreader->private_data);
 	XLogReaderFree(xlogreader);
 
 #undef PG_GET_WAL_RECORDS_INFO_COLS
@@ -555,20 +562,15 @@ GetWalStats(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
 
 	MemSet(&stats, 0, sizeof(stats));
 
-	for (;;)
+	while (ReadNextXLogRecord(xlogreader, first_record) &&
+		   xlogreader->EndRecPtr <= end_lsn)
 	{
-		(void) ReadNextXLogRecord(xlogreader, first_record);
-
-		if (xlogreader->EndRecPtr <= end_lsn)
-			XLogRecStoreStats(&stats, xlogreader);
-
-		/* if we read up to end_lsn, we're done */
-		if (xlogreader->EndRecPtr >= end_lsn)
-			break;
+		XLogRecStoreStats(&stats, xlogreader);
 
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	pfree(xlogreader->private_data);
 	XLogReaderFree(xlogreader);
 
 	MemSet(values, 0, sizeof(values));
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 425702641a..1c52b128fc 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -957,7 +957,19 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 			/* If asked, let's not wait for future WAL. */
 			if (!wait_for_wal)
+			{
+				ReadLocalXLogPageNoWaitPrivate *private_data;
+
+				/*
+				 * Inform the caller of read_local_xlog_page_no_wait that the
+				 * end of WAL has reached so that it can deal wiht it
+				 * accordingly.
+				 */
+				private_data = (ReadLocalXLogPageNoWaitPrivate *)
+										state->private_data;
+				private_data->end_of_wal = true;
 				break;
+			}
 
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 761625acf4..5fcbbc136f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -75,6 +75,12 @@ typedef enum
 								 * need to be replayed) */
 } XLogRedoAction;
 
+/* Private data of the read_local_xlog_page_no_wait callback. */
+typedef struct ReadLocalXLogPageNoWaitPrivate
+{
+	bool end_of_wal;	/* true, when end of WAL is reached */
+} ReadLocalXLogPageNoWaitPrivate;
+
 extern XLogRedoAction XLogReadBufferForRedo(XLogReaderState *record,
 											uint8 buffer_id, Buffer *buf);
 extern Buffer XLogInitBufferForRedo(XLogReaderState *record, uint8 block_id);
-- 
2.25.1

#18

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Bharath Rupireddy (#16)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, 2022-04-27 at 13:47 +0530, Bharath Rupireddy wrote:

I found an easy way to reproduce this consistently (I think on any
server):

I basically generated huge WAL record (I used a fun extension that I
wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
use pg_logical_emit_message as well)

Thank you Bharath for creating the extension and the simple test case.

Thomas's patch solves the issue for me as well.

Tom, the debug patch you posted[0]/messages/by-id/295868.1651024073@sss.pgh.pa.us seems to be setting the error
message if it's not already set. Thomas's patch uses the lack of a
message as a signal that we've reached the end of WAL. That explains
why you are still seeing the problem.

Obviously, that's a sign that Thomas's patch is not the cleanest
solution. But other approaches would be more invasive. I guess the
question is whether that's a good enough solution for now, and
hopefully we could improve the API later; or whether we need to come up
with something better.

When reviewing, I considered the inability to read old WAL and the
inability to read flushed-in-the-middle-of-a-record WAL as similar
kinds of errors that the user would need to deal with. But they are
different: the former can be avoided by creating a slot; the latter
can't be easily avoided, only retried.

Depending on the intended use cases, forcing the user to retry might be
reasonable, in which case we could consider this a test problem rather
than a real problem, and we might be able to do something simpler to
just stabilize the test.

Regards,
Jeff Davis

[0]: /messages/by-id/295868.1651024073@sss.pgh.pa.us

#19

thomas.munro@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#17)

Re: pgsql: Add contrib/pg_walinspect.

On Wed, Apr 27, 2022 at 10:22 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Here's v2 patch (up on Thomas's v1 at [1]) using private_data to set
the end of the WAL flag. Please have a look at it.

I don't have a strong view on whether it's better to use a NULL error
for this private communication between pg_walinspect.c and the
read_page callback it installs, or install a custom private_data to
signal this condition, or to give up on all this stuff completely and
just wait (see below for thoughts). I'd feel better about both
no-wait options if the read_page callback in question were actually in
the contrib module, and not in the core code. On the other hand, I'm
not too hung up about it because I'm really hoping to see the
get-rid-of-all-these-callbacks-and-make-client-code-do-the-waiting
scheme proposed by Horiguchi-san and Heikki developed further, to rip
much of this stuff out in a future release.

If you go with the private_data approach, a couple of small comments:

     if (record == NULL)
     {
+        ReadLocalXLogPageNoWaitPrivate *private_data;
+
+        /* return NULL, if end of WAL is reached */
+        private_data = (ReadLocalXLogPageNoWaitPrivate *)
+                            xlogreader->private_data;
+
+        if (private_data->end_of_wal)
+            return NULL;
+
         if (errormsg)
             ereport(ERROR,
                     (errcode_for_file_access(),
                      errmsg("could not read WAL at %X/%X: %s",
                             LSN_FORMAT_ARGS(first_record), errormsg)));
-        else
-            ereport(ERROR,
-                    (errcode_for_file_access(),
-                     errmsg("could not read WAL at %X/%X",
-                            LSN_FORMAT_ARGS(first_record))));
     }

I think you should leave that second ereport() call in, in this
variant of the patch, no? I don't know if anything else raises errors
with no message, but if we're still going to treat them as silent
end-of-data conditions then you might as well go with the v1 patch.

Another option might be to abandon this whole no-wait concept and
revert 2258e76f's changes to xlogutils.c. pg_walinspect already does
preliminary checks that LSNs are in range, so you can't enter a value
that will wait indefinitely, and it's interruptible (it's a 1ms
sleep/check loop, not my favourite programming pattern but that's
pre-existing code). If you're unlucky enough to hit the case where
the LSN is judged to be valid but is in the middle of a record that
hasn't been totally flushed yet, it'll just be a bit slower to return
as we wait for the inevitable later flush(es) to happen. The rest of
your record will *surely* be flushed pretty soon (or the flushing
backend panics the whole system and time ends). I don't imagine this
is performance critical work, so maybe that'd be acceptable?

#20

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Thomas Munro (#19)

Re: pgsql: Add contrib/pg_walinspect.

On Thu, 2022-04-28 at 12:11 +1200, Thomas Munro wrote:

Another option might be to abandon this whole no-wait concept and
revert 2258e76f's changes to xlogutils.c. pg_walinspect already does
preliminary checks that LSNs are in range, so you can't enter a value
that will wait indefinitely, and it's interruptible (it's a 1ms
sleep/check loop, not my favourite programming pattern but that's
pre-existing code). If you're unlucky enough to hit the case where
the LSN is judged to be valid but is in the middle of a record that
hasn't been totally flushed yet, it'll just be a bit slower to return
as we wait for the inevitable later flush(es) to happen. The rest of
your record will *surely* be flushed pretty soon (or the flushing
backend panics the whole system and time ends). I don't imagine this
is performance critical work, so maybe that'd be acceptable?

I'm inclined toward this option. I was a bit iffy on those xlogutils.c
changes to begin with, and they are causing a problem now, so I'd like
to avoid layering on more workarounds.

The time when we need to wait is very narrow, only in this case where
it's earlier than the flush pointer, and the flush pointer is in the
middle of a record that's not fully flushed. And as you say, we won't
be waiting very long in that case, because once we start to write a WAL
record it better finish soon.

Bharath, thoughts? When you originally introduced the nowait behavior,
I believe that was to solve the case where someone specifies an LSN
range well in the future, but we can still catch that and throw an
error if we see that it's beyond the flush pointer. Do you see a
problem with just doing that and getting rid of the nowait changes?

Regards,
Jeff Davis

#21