Infinite loop in XLogPageRead() on standby

michael@paquier.xyz

almost 2 years ago

In reply to: Alexander Kukushkin (#1)

Re: Infinite loop in XLogPageRead() on standby

On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote:

I spent some time debugging an issue with standby not being able to
continue streaming after failover.

The problem happens when standbys received only the first part of the WAL
record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record
and writes END_OF_RECOVERY instead. If in addition to that someone will
call pg_switch_wal(), then there are chances that SWITCH record will also
fit to the page where the discarded part was settling, As a result the
other standby (that wasn't promoted) will infinitely try making attempts to
decode WAL record span on multiple pages by reading the next page, which is
filled with zero bytes. And, this next page will never be written, because
the new primary will be writing to the new WAL file after pg_switch_wal().

Wow. Have you seen that in an actual production environment?

I was just trying your TAP test to see it looping on a single record
as you mentioned:
2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in
WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in
WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496

Restart of the stuck standby fixes the problem, because it will be first
reading the history file and therefore will never read the incomplete WAL
file from the old timeline. That is, all major versions starting from v13
are impacted (including the master branch), because they allow changing of
primary_conninfo GUC with reload.

Still that's not nice at a large scale, because you would not know
about the problem until your monitoring tools raise alarms because
some nodes in your cluster setup decide to lag behind.

Please find attached the TAP test that reproduces the problem.

my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
"\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);

So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record. Then abuse of WAL archiving to
force the replay of the last record. That's kind of cool.

To be honest, I don't know yet how to fix it nicely. I am thinking about
returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
timeline while trying to read a page and if this page is invalid.

Hmm. I suspect that you may be right on a TLI change when reading a
page. There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader(). Perhaps you
have an idea of patch to show your point?

Nit. In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.
--
Michael

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#2)

Re: Infinite loop in XLogPageRead() on standby

At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote:

I spent some time debugging an issue with standby not being able to
continue streaming after failover.

The problem happens when standbys received only the first part of the WAL
record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record
and writes END_OF_RECOVERY instead. If in addition to that someone will
call pg_switch_wal(), then there are chances that SWITCH record will also
fit to the page where the discarded part was settling, As a result the
other standby (that wasn't promoted) will infinitely try making attempts to
decode WAL record span on multiple pages by reading the next page, which is
filled with zero bytes. And, this next page will never be written, because
the new primary will be writing to the new WAL file after pg_switch_wal().

In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.

In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment. But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming. A similar problem
to the issue here was seen at segment boundaries, before we introduced
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag, which prevents overwriting
a WAL file that is already archived. However, in this case, the second
standby won't see the broken record because it cannot be in a
non-partial segment in the archive, and the new primary streams
END_OF_RECOVERY instead of the broken record.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

cyberdemn@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#2)

Re: Infinite loop in XLogPageRead() on standby

Hi Michael,

On Thu, 29 Feb 2024 at 06:05, Michael Paquier <michael@paquier.xyz> wrote:

Wow. Have you seen that in an actual production environment?

Yes, we see it regularly, and it is reproducible in test environments as
well.

my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
"\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);

So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record. Then abuse of WAL archiving to
force the replay of the last record. That's kind of cool.

Right, at this point it is easier than to cause an artificial crash on the
primary after it finished writing just one page.

To be honest, I don't know yet how to fix it nicely. I am thinking about
returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a

new

timeline while trying to read a page and if this page is invalid.

Hmm. I suspect that you may be right on a TLI change when reading a
page. There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader(). Perhaps you
have an idea of patch to show your point?

Not yet, but hopefully I will get something done next week.

Nit. In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.

Thanks, I'll look into it.

--
Regards,
--
Alexander Kukushkin

cyberdemn@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#3)

Re: Infinite loop in XLogPageRead() on standby

Hi Kyotaro,

On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

In the first place, it's important to note that we do not guarantee

that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

Sure, I know. But in this case the async standby received and flushed
absolutely the same amount of WAL as the promoted one.

I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.

It easily happens if the primary crashed and standbys didn't receive
another page with continuation record.

In the repro script, the replication connection of the second standby

is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.

Well, maybe, but apparently the standby is busy trying to decode a record
that spans multiple pages, and it is just infinitely waiting for the next
page to arrive. Also, the restart "fixes" the problem, because indeed it is
reading the file from the beginning.

But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.

As I already said, this is a simple way to emulate the primary crash while
standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully
received and flushed.

--
Regards,
--
Alexander Kukushkin

michael@paquier.xyz

almost 2 years ago

In reply to: Alexander Kukushkin (#5)

Re: Infinite loop in XLogPageRead() on standby

On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:

On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

Sure, I know. But in this case the async standby received and flushed
absolutely the same amount of WAL as the promoted one.

Ugh. If it can happen transparently to the user without the user
knowing directly about it, that does not sound good to me. I did not
look very closely at monitoring tools available out there, but if both
standbys flushed the same WAL locations a rewind should not be
required. It is not something that monitoring tools would be able to
detect because they just look at LSNs.

In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.

Well, maybe, but apparently the standby is busy trying to decode a record
that spans multiple pages, and it is just infinitely waiting for the next
page to arrive. Also, the restart "fixes" the problem, because indeed it is
reading the file from the beginning.

What happens if the continuation record spawns across multiple segment
files boundaries in this case? We would go back to the beginning of
the segment where the record spawning across multiple segments began,
right? (I may recall this part of contrecords incorrectly, feel free
to correct me if necessary.)

But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.

As I already said, this is a simple way to emulate the primary crash while
standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully
received and flushed.

I think that's OK to do so at test level to force a test in the
backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
proved that this can be designed to be cheap and stable across the
buildfarm fleet.

For anything like that, the result had better have solid test
coverage, where perhaps we'd better refactor some of the routines of
039_end_of_wal.pl into a module to use them here, rather than
duplicate the code. The other test has a few assumptions with the
calculation of page boundaries, and I'd rather not duplicate that
across the tree.

Seeing that Alexander is a maintainer of Patroni, which is very
probably used by his employer across a large set of PostgreSQL
instances, if he says that he's seen that in the field, that's good
enough for me to respond to the problem, especially if reconnecting a
standby to a promoted node where both flushed the same LSN. Now the
level of response also depends on the invasiness of the change, and we
need a very careful evaluation here.
--
Michael

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#6)

Re: Infinite loop in XLogPageRead() on standby

At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:

On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

Sure, I know. But in this case the async standby received and flushed
absolutely the same amount of WAL as the promoted one.

Ugh. If it can happen transparently to the user without the user
knowing directly about it, that does not sound good to me. I did not
look very closely at monitoring tools available out there, but if both
standbys flushed the same WAL locations a rewind should not be
required. It is not something that monitoring tools would be able to
detect because they just look at LSNs.

In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.

Well, maybe, but apparently the standby is busy trying to decode a record
that spans multiple pages, and it is just infinitely waiting for the next
page to arrive. Also, the restart "fixes" the problem, because indeed it is
reading the file from the beginning.

What happens if the continuation record spawns across multiple segment
files boundaries in this case? We would go back to the beginning of
the segment where the record spawning across multiple segments began,
right? (I may recall this part of contrecords incorrectly, feel free
to correct me if necessary.)

After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.

But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.

As I already said, this is a simple way to emulate the primary crash while
standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully
received and flushed.

I think that's OK to do so at test level to force a test in the
backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
proved that this can be designed to be cheap and stable across the
buildfarm fleet.

Yeah, I agree that it clearly illustrates the final state after the
issue happened, but if my assumption above is correct, the test
doesn't manifest the real issue.

For anything like that, the result had better have solid test
coverage, where perhaps we'd better refactor some of the routines of
039_end_of_wal.pl into a module to use them here, rather than
duplicate the code. The other test has a few assumptions with the
calculation of page boundaries, and I'd rather not duplicate that
across the tree.

I agree to the point.

Seeing that Alexander is a maintainer of Patroni, which is very
probably used by his employer across a large set of PostgreSQL
instances, if he says that he's seen that in the field, that's good
enough for me to respond to the problem, especially if reconnecting a
standby to a promoted node where both flushed the same LSN. Now the
level of response also depends on the invasiness of the change, and we
need a very careful evaluation here.

I don't mean to say that we should respond with DNF to this "issue" at
all. I simply wanted to make the real issue clear.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Infinite loop in XLogPageRead() on standby

At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.

That's my stupid. The function performs reconnection by itself.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#8)

Re: Infinite loop in XLogPageRead() on standby

At Fri, 01 Mar 2024 12:04:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.

That's my stupid. The function performs reconnection by itself.

Anyway, our current policy here is to avoid record-rereads beyond
source switches. However, fixing this seems to require that source
switches cause record rereads unless some additional information is
available to know if replay LSN needs to back up.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: Infinite loop in XLogPageRead() on standby

At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Anyway, our current policy here is to avoid record-rereads beyond
source switches. However, fixing this seems to require that source
switches cause record rereads unless some additional information is
available to know if replay LSN needs to back up.

It seems to me that the error messages are related to commit 0668719801.

XLogPageRead:

* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because ReadPageInternal()
* validates the page header anyway, and would propagate the failure up to
* ReadRecord(), which would retry. However, there's a corner case with
* continuation records, if a record is split across two pages such that
* we would need to read the two pages from different sources. For

...

if (StandbyMode &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{
/*
* Emit this error right now then retry this page immediately. Use
* errmsg_internal() because the message was already translated.
*/
if (xlogreader->errormsg_buf[0])
ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
(errmsg_internal("%s", xlogreader->errormsg_buf)));

This code intends to prevent a page header error from causing a record
reread, when a record is required to be read from multiple sources. We
could restrict this to only fire at segment boundaries. At segment
boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD.

Having thought up to this point, I now believe that we should
completely prevent LSN from going back in any case. One drawback is
that the fix cannot be back-patched.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

cyberdemn@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#10)

2 attachment(s)

Re: Infinite loop in XLogPageRead() on standby

Hello Michael, Kyotaro,

Please find attached the patch fixing the problem and the updated TAP test
that addresses Nit.

--
Regards,
--
Alexander Kukushkin

Attachments:

042_no_contrecord_switch.plapplication/x-perl; name=042_no_contrecord_switch.plDownload

xlogrecovery-no-contrecord.patchtext/x-patch; charset=US-ASCII; name=xlogrecovery-no-contrecord.patchDownload

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 853b540945..f98ad350a9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3301,6 +3301,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	bool		timelineChanged = false;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3425,6 +3426,16 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
+	/*
+	 * Check if timeline changed while we are trying to read the page. If a
+	 * read page is still invalid after timeline change we want to return from
+	 * the function with the XLREAD_FAIL error instead or retrying, because we
+	 * might be trying to decode record that spans multiple pages. If under
+	 * this condition other replica was promoted and switched to the new WAL
+	 * segment, than the page we are trying to read might never be written.
+	 */
+	timelineChanged = xlogreader->seg.ws_tli != curFileTLI;
+
 	xlogreader->seg.ws_tli = curFileTLI;
 
 	/*
@@ -3494,8 +3505,12 @@ next_record_is_invalid:
 	readLen = 0;
 	readSource = XLOG_FROM_ANY;
 
-	/* In standby-mode, keep trying */
-	if (StandbyMode)
+	/*
+	 * In standby-mode, keep trying, but only if we know that timeline didn't
+	 * change, because after timeline change the new primary might have
+	 * overwrote previous parts of the record that spans multiple pages.
+	 */
+	if (StandbyMode && !timelineChanged)
 		goto retry;
 	else
 		return XLREAD_FAIL;

#12

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Alexander Kukushkin (#11)

Re: Infinite loop in XLogPageRead() on standby

At Tue, 5 Mar 2024 09:36:44 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in

Please find attached the patch fixing the problem and the updated TAP test
that addresses Nit.

Record-level retries happen when the upper layer detects errors. In my
previous mail, I cited code that is intended to prevent this at
segment boundaries. However, the resulting code applies to all page
boundaries, since we judged that the difference doen't significanty
affects the outcome.

* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because ReadPageInternal()
* validates the page header anyway, and would propagate the failure up to

So, the following (tentative) change should also work.

xlogrecovery.c:
@@ -3460,8 +3490,10 @@ retry:
* responsible for the validation.
*/
if (StandbyMode &&
+ targetPagePtr % 0x100000 == 0 &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{

Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13

cyberdemn@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#12)

Re: Infinite loop in XLogPageRead() on standby

Hi Kyotaro,

Oh, now I understand what you mean. Is the retry supposed to happen only
when we are reading the very first page from the WAL file?

On Wed, 6 Mar 2024 at 09:57, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

xlogrecovery.c:
@@ -3460,8 +3490,10 @@ retry:
* responsible for the validation.
*/
if (StandbyMode &&
+               targetPagePtr % 0x100000 == 0 &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
readBuf))
{

Hmm, I think you meant to use wal_segment_size, because 0x100000 is just
1MB. As a result, currently it works for you by accident.

Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.

Thank you for spending your time on it!

--
Regards,
--
Alexander Kukushkin

#14

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Alexander Kukushkin (#13)

Re: Infinite loop in XLogPageRead() on standby

At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in

Hmm, I think you meant to use wal_segment_size, because 0x100000 is just
1MB. As a result, currently it works for you by accident.

Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.

Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.

Thank you for spending your time on it!

You're welcome, but I aplogize for the delay in the work..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15

michael@paquier.xyz

almost 2 years ago

In reply to: Kyotaro Horiguchi (#14)

Re: Infinite loop in XLogPageRead() on standby

On Mon, Mar 11, 2024 at 04:43:32PM +0900, Kyotaro Horiguchi wrote:

At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in

Thank you for spending your time on it!

You're welcome, but I aplogize for the delay in the work..

Thanks for spending time on this. Everybody is busy with the last
commit fest, and the next minor release set is planned for May so
there should still be time even after the feature freeze:
https://www.postgresql.org/developer/roadmap/

I should be able to come back to this thread around the beginning of
April. If you are able to do some progress in-between, that would be
surely helpful.

Thanks,
--
Michael

#16

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#14)

Re: Infinite loop in XLogPageRead() on standby

At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.

I identified the cause of the second issue. When I tried to replay the
issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().

I think the primary change to XLogPageRead that I suggested is correct
(assuming the use of wal_segment_size instead of the
constant). However, still XLogFileReadAnyTLI() has a chance to read
the segment from the old timeline after the second standby notices a
timeline switch, leading to the second issue. The second issue was
fixed by preventing XLogFileReadAnyTLI from reading segments from
older timelines than those suggested by the latest timeline
history. (In other words, disabling the "AnyTLI" part).

I recall that there was a discussion for commit 4bd0ad9e44, about the
objective of allowing reading segments from older timelines than the
timeline history suggests. In my faint memory, we concluded to
postpone making the decision to remove the feature due to uncertainity
about the objective. If there's no clear reason to continue using
XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
XLogFileReadOnTLHistory(), which reads segments that align precisely
with the timeline history.

Of course, regardless of the changes above, if recovery on the second
standby had reached the end of the page-spanning record before
redirection to the first standby, it would need pg_rewind to connect
to the first standby.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17

[1]: /messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com

cyberdemn@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#16)

Re: Infinite loop in XLogPageRead() on standby

Hi Kyotaro,

On Wed, 13 Mar 2024 at 03:56, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

I identified the cause of the second issue. When I tried to replay the

issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().

Hmm, I don't think it could really be prevented.
There are always chances that the standby that is not ahead of other
standbys could be promoted due to reasons like:
1. HA configuration doesn't let certain nodes to be promoted.
2. This is an async standby (name isn't listed in
synchronous_standby_names) and it was ahead of promoted sync standby. No
data loss from the client point of view.

Of course, regardless of the changes above, if recovery on the second
standby had reached the end of the page-spanning record before
redirection to the first standby, it would need pg_rewind to connect
to the first standby.

Correct, IMO pg_rewind is a right way of solving it.

Regards,
--
Alexander Kukushkin

#18

Ants Aasma

ants.aasma@cybertec.at

almost 2 years ago

In reply to: Kyotaro Horiguchi (#16)

Re: Infinite loop in XLogPageRead() on standby

On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.

I identified the cause of the second issue. When I tried to replay the
issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().

I think the primary change to XLogPageRead that I suggested is correct
(assuming the use of wal_segment_size instead of the
constant). However, still XLogFileReadAnyTLI() has a chance to read
the segment from the old timeline after the second standby notices a
timeline switch, leading to the second issue. The second issue was
fixed by preventing XLogFileReadAnyTLI from reading segments from
older timelines than those suggested by the latest timeline
history. (In other words, disabling the "AnyTLI" part).

I recall that there was a discussion for commit 4bd0ad9e44, about the
objective of allowing reading segments from older timelines than the
timeline history suggests. In my faint memory, we concluded to
postpone making the decision to remove the feature due to uncertainity
about the objective. If there's no clear reason to continue using
XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
XLogFileReadOnTLHistory(), which reads segments that align precisely
with the timeline history.

This sounds very similar to the problem described in [1]/messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com. And I think
both will be resolved by that change.

#19

cyberdemn@gmail.com

over 1 year ago

In reply to: Ants Aasma (#18)

Re: Infinite loop in XLogPageRead() on standby

Hi Michael and Kyotaro,

Now that beta1 was released I hope you are not so busy and hence would like
to follow up on this problem.

Regards,
--
Alexander Kukushkin

#20

michael@paquier.xyz

over 1 year ago

In reply to: Alexander Kukushkin (#19)

Re: Infinite loop in XLogPageRead() on standby

On Tue, Jun 04, 2024 at 04:16:43PM +0200, Alexander Kukushkin wrote:

Now that beta1 was released I hope you are not so busy and hence would like
to follow up on this problem.

I am still working on something for the v18 cycle that I'd like to
present before the beginning of the next commit fest, so I am a bit
busy to get that out first. Fingers crossed to not have open items to
look at.. This thread is one of the things I have marked as an item
to look at, yes.
--
Michael

#21

cyberdemn@gmail.com

about 1 year ago

In reply to: Michael Paquier (#20)

Re: Infinite loop in XLogPageRead() on standby

Hi Michael,

Now that v17 is released and before v18 feature freeze we have a few
months, I hope you will find some time to look at it.

On Wed, 5 Jun 2024 at 07:09, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jun 04, 2024 at 04:16:43PM +0200, Alexander Kukushkin wrote:

Now that beta1 was released I hope you are not so busy and hence would

like

to follow up on this problem.

I am still working on something for the v18 cycle that I'd like to
present before the beginning of the next commit fest, so I am a bit
busy to get that out first. Fingers crossed to not have open items to
look at.. This thread is one of the things I have marked as an item
to look at, yes.
--
Michael

Regards,
--
Alexander Kukushkin

#22

michael@paquier.xyz

about 1 year ago

In reply to: Alexander Kukushkin (#21)

Re: Infinite loop in XLogPageRead() on standby

On Wed, Nov 13, 2024 at 02:18:06PM +0100, Alexander Kukushkin wrote:

Now that v17 is released and before v18 feature freeze we have a few
months, I hope you will find some time to look at it.

My apologies for taking a couple of weeks before coming back to this
thread. I have been informed a couple of days ago about a report
which is showing the same symptoms as what you have reported. Please
note that we have two other threads which refer to the same problem as
what you have here:
/messages/by-id/15567-a9ff747c5e7c031c@postgresql.org
/messages/by-id/CAJ+DhQaf9ZMkqPPnapu06mvMFe-b_K4NGF+-MV+e=JB2edF4cQ@mail.gmail.com

All of them refer to an infinite loop reachable in the startup process
when we read an incorrect incomplete record just after a failover or
when a WAL receiver restarts. Not sure which way is best in order to
fix all of them in one shot, I am still considering what the possible
options at hand are here.
--
Michael

#23

michael@paquier.xyz

about 1 year ago

In reply to: Kyotaro Horiguchi (#10)

Re: Infinite loop in XLogPageRead() on standby

On Fri, Mar 01, 2024 at 01:16:37PM +0900, Kyotaro Horiguchi wrote:

This code intends to prevent a page header error from causing a record
reread, when a record is required to be read from multiple sources. We
could restrict this to only fire at segment boundaries. At segment
boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD.

Having thought up to this point, I now believe that we should
completely prevent LSN from going back in any case. One drawback is
that the fix cannot be back-patched.

I don't immediately see why we should not be able to backpatch a fix
that restricts the early page header check to happen only at segment
boundary when we are in standby mode. Perhaps I'm just missing your
point.

If we cannot use that, wouldn't it be better just to stick with the
proposed check on the TLI and force an error in the page read callback
if the situation has changed due to a TLI jump with curFileTLI so as
we don't loop and check infinitely these incorrect pages?
--
Michael

#24

michael@paquier.xyz

12 months ago

In reply to: Michael Paquier (#22)

2 attachment(s)

Re: Infinite loop in XLogPageRead() on standby

On Wed, Dec 25, 2024 at 12:00:59PM +0900, Michael Paquier wrote:

All of them refer to an infinite loop reachable in the startup process
when we read an incorrect incomplete record just after a failover or
when a WAL receiver restarts. Not sure which way is best in order to
fix all of them in one shot, I am still considering what the possible
options at hand are here.

And please find attached a set of two patches addressing the issue.
I've reviewed again the thread and after pondering quite a bit about
it I am on board with the point of making the early header check in
the page read callback less generic by restricting it to a WAL segment
boundary, as that's the original case we care about, to map also with
the fact that WAL receivers are spawned with a LSN starting at the
beginning of segment, as suggested by Horiguchi-san. As far as I can
see, I've misunderstood a bit things last week about the backpatch
point, sorry about that.

The new regression test is something I really want to keep around,
to be able to emulate the infinite loop, but I got annoyed with the
amount of duplication between the new test and the existing
039_end_of_wal.pl as there share the same ideas. I have refactored
039_end_of_wal.pl and moved its routines to generate WAL messages,
to advance WAL and to write WAL into Cluster.pm, then reused the
refactored result in the new test. I've also detected a race
condition in the new test, where we should wait for the logs of the
two standbys to report their "invalid magic number" failures. The
test could fail if the replay is slow when scanning their logs. There
is wait_for_log() for this job, available down to v13.

Barring objections, I'd like to get the main issue in 0002 fixed at
the beginning of next week. I am also planning to do the refactoring
bits tomorrow or so as these are rather straight-forward. The names
of the new routines in Cluster.pm are inherited from the existing
recovery test. Perhaps they could use a better name, but 0001 does
not look that bad to me either.
--
Michael

Attachments:

0001-Move-WAL-write-and-advance-routines-into-PostgreSQL-.patchtext/x-diff; charset=us-asciiDownload

From 4f25575aafdb8695e9884310df4af4fcb7ee256f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 15 Jan 2025 12:23:53 +0900
Subject: [PATCH 1/2] Move WAL write and advance routines into
 PostgreSQL::Test::Cluster

These facilities were originally in 039_end_of_wal.pl, and a follow-up
bug fix with a TAP test doing similar WAL manipulations requires them.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 146 +++++++++++++
 src/test/recovery/t/039_end_of_wal.pl    | 257 ++++++-----------------
 2 files changed, 212 insertions(+), 191 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 08b89a4cdf..30b7ab515d 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2954,6 +2954,152 @@ sub lsn
 
 =pod
 
+=item $node->write_wal($tli, $lsn, $segment_size, $data)
+
+Write some arbitrary data in WAL for the given segment at $lsn (in bytes).
+This should be called while the cluster is not running.
+
+Returns the path of the WAL segment written to.
+
+=cut
+
+sub write_wal
+{
+	my ($self, $tli, $lsn, $segment_size, $data) = @_;
+
+	# Calculate segment number and offset position in segment based on the
+	# input LSN.
+	my $segment = $lsn / $segment_size;
+	my $offset = $lsn % $segment_size;
+	my $path =
+	  sprintf("%s/pg_wal/%08X%08X%08X", $self->data_dir, $tli, 0, $segment);
+
+	open my $fh, "+<:raw", $path or die "could not open WAL segment $path";
+	seek($fh, $offset, SEEK_SET) or die "could not seek WAL segment $path";
+	print $fh $data;
+	close $fh;
+
+	return $path;
+}
+
+=pod
+
+=item $node->emit_message($size)
+
+Emit a WAL record of arbitrary size.
+
+Returns the end LSN of the record inserted, in bytes.
+
+=cut
+
+sub emit_message
+{
+	my ($self, $size) = @_;
+
+	return int(
+		$self->safe_psql(
+			'postgres',
+			"SELECT pg_logical_emit_message(true, '', repeat('a', $size)) - '0/0'"
+		));
+}
+
+
+# Private routine returning the current insert LSN of a node, in bytes.
+# Used by routines in charge of advancing WAL.
+sub _get_insert_lsn
+{
+	my ($self) = @_;
+	return int(
+		$self->safe_psql(
+			'postgres', "SELECT pg_current_wal_insert_lsn() - '0/0'"));
+}
+
+=pod
+
+=item $node->advance_wal_out_of_record_splitting_zone($wal_block_size)
+
+This inserts a few records of a fixed size, until the threshold gets close
+enough to the end of the WAL page inserting records to.  Make sure we are
+far away enough from the end of a page that we could insert a couple of
+small records.
+
+Returns the end LSN up to which WAL has advanced.
+
+=cut
+
+sub advance_wal_out_of_record_splitting_zone
+{
+	my ($self, $wal_block_size) = @_;
+
+	my $page_threshold = $wal_block_size / 4;
+	my $end_lsn = $self->_get_insert_lsn();
+	my $page_offset = $end_lsn % $wal_block_size;
+	while ($page_offset >= $wal_block_size - $page_threshold)
+	{
+		$self->emit_message($page_threshold);
+		$end_lsn = $self->_get_insert_lsn();
+		$page_offset = $end_lsn % $wal_block_size;
+	}
+	return $end_lsn;
+}
+
+=pod
+
+=item $node->advance_wal_to_record_splitting_zone($wal_block_size)
+
+Advance WAL so close to the end of a page that an XLogRecordHeader would not
+fit on it.
+
+Returns the end LSN up to which WAL has advanced.
+
+=cut
+
+sub advance_wal_to_record_splitting_zone
+{
+	my ($self, $wal_block_size) = @_;
+
+	# Size of record header.
+	my $RECORD_HEADER_SIZE = 24;
+
+	my $end_lsn = $self->_get_insert_lsn();
+	my $page_offset = $end_lsn % $wal_block_size;
+
+	# Get fairly close to the end of a page in big steps
+	while ($page_offset <= $wal_block_size - 512)
+	{
+		$self->emit_message($wal_block_size - $page_offset - 256);
+		$end_lsn = $self->_get_insert_lsn();
+		$page_offset = $end_lsn % $wal_block_size;
+	}
+
+	# Calibrate our message size so that we can get closer 8 bytes at
+	# a time.
+	my $message_size = $wal_block_size - 80;
+	while ($page_offset <= $wal_block_size - $RECORD_HEADER_SIZE)
+	{
+		$self->emit_message($message_size);
+		$end_lsn = $self->_get_insert_lsn();
+
+		my $old_offset = $page_offset;
+		$page_offset = $end_lsn % $wal_block_size;
+
+		# Adjust the message size until it causes 8 bytes changes in
+		# offset, enough to be able to split a record header.
+		my $delta = $page_offset - $old_offset;
+		if ($delta > 8)
+		{
+			$message_size -= 8;
+		}
+		elsif ($delta <= 0)
+		{
+			$message_size += 8;
+		}
+	}
+	return $end_lsn;
+}
+
+=pod
+
 =item $node->check_extension(extension_name)
 
 Scan pg_available_extensions to check that an extension is available in an
diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index ab751eb271..9c8d5290ac 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -20,9 +20,6 @@ use integer;    # causes / operator to use integer math
 # we need to know the endianness to do that.
 my $BIG_ENDIAN = pack("L", 0x12345678) eq pack("N", 0x12345678);
 
-# Header size of record header.
-my $RECORD_HEADER_SIZE = 24;
-
 # Fields retrieved from code headers.
 my @scan_result = scan_server_header('access/xlog_internal.h',
 	'#define\s+XLOG_PAGE_MAGIC\s+(\w+)');
@@ -36,64 +33,6 @@ my $WAL_SEGMENT_SIZE;
 my $WAL_BLOCK_SIZE;
 my $TLI;
 
-# Build path of a WAL segment.
-sub wal_segment_path
-{
-	my $node = shift;
-	my $tli = shift;
-	my $segment = shift;
-	my $wal_path =
-	  sprintf("%s/pg_wal/%08X%08X%08X", $node->data_dir, $tli, 0, $segment);
-	return $wal_path;
-}
-
-# Calculate from a LSN (in bytes) its segment number and its offset.
-sub lsn_to_segment_and_offset
-{
-	my $lsn = shift;
-	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
-}
-
-# Write some arbitrary data in WAL for the given segment at LSN.
-# This should be called while the cluster is not running.
-sub write_wal
-{
-	my $node = shift;
-	my $tli = shift;
-	my $lsn = shift;
-	my $data = shift;
-
-	my ($segment, $offset) = lsn_to_segment_and_offset($lsn);
-	my $path = wal_segment_path($node, $tli, $segment);
-
-	open my $fh, "+<:raw", $path or die;
-	seek($fh, $offset, SEEK_SET) or die;
-	print $fh $data;
-	close $fh;
-}
-
-# Emit a WAL record of arbitrary size.  Returns the end LSN of the
-# record inserted, in bytes.
-sub emit_message
-{
-	my $node = shift;
-	my $size = shift;
-	return int(
-		$node->safe_psql(
-			'postgres',
-			"SELECT pg_logical_emit_message(true, '', repeat('a', $size)) - '0/0'"
-		));
-}
-
-# Get the current insert LSN of a node, in bytes.
-sub get_insert_lsn
-{
-	my $node = shift;
-	return int(
-		$node->safe_psql(
-			'postgres', "SELECT pg_current_wal_insert_lsn() - '0/0'"));
-}
-
 # Get GUC value, converted to an int.
 sub get_int_setting
 {
@@ -167,69 +106,6 @@ sub build_page_header
 		$BIG_ENDIAN ? $xlp_pageaddr : 0, $xlp_rem_len);
 }
 
-# Make sure we are far away enough from the end of a page that we could insert
-# a couple of small records.  This inserts a few records of a fixed size, until
-# the threshold gets close enough to the end of the WAL page inserting records
-# to.
-sub advance_out_of_record_splitting_zone
-{
-	my $node = shift;
-
-	my $page_threshold = $WAL_BLOCK_SIZE / 4;
-	my $end_lsn = get_insert_lsn($node);
-	my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
-	{
-		emit_message($node, $page_threshold);
-		$end_lsn = get_insert_lsn($node);
-		$page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	}
-	return $end_lsn;
-}
-
-# Advance so close to the end of a page that an XLogRecordHeader would not
-# fit on it.
-sub advance_to_record_splitting_zone
-{
-	my $node = shift;
-
-	my $end_lsn = get_insert_lsn($node);
-	my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-
-	# Get fairly close to the end of a page in big steps
-	while ($page_offset <= $WAL_BLOCK_SIZE - 512)
-	{
-		emit_message($node, $WAL_BLOCK_SIZE - $page_offset - 256);
-		$end_lsn = get_insert_lsn($node);
-		$page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	}
-
-	# Calibrate our message size so that we can get closer 8 bytes at
-	# a time.
-	my $message_size = $WAL_BLOCK_SIZE - 80;
-	while ($page_offset <= $WAL_BLOCK_SIZE - $RECORD_HEADER_SIZE)
-	{
-		emit_message($node, $message_size);
-		$end_lsn = get_insert_lsn($node);
-
-		my $old_offset = $page_offset;
-		$page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-
-		# Adjust the message size until it causes 8 bytes changes in
-		# offset, enough to be able to split a record header.
-		my $delta = $page_offset - $old_offset;
-		if ($delta > 8)
-		{
-			$message_size -= 8;
-		}
-		elsif ($delta <= 0)
-		{
-			$message_size += 8;
-		}
-	}
-	return $end_lsn;
-}
-
 # Setup a new node.  The configuration chosen here minimizes the number
 # of arbitrary records that could get generated in a cluster.  Enlarging
 # checkpoint_timeout avoids noise with checkpoint activity.  wal_level
@@ -265,8 +141,8 @@ note "Single-page end-of-WAL detection";
 ###########################################################################
 
 # xl_tot_len is 0 (a common case, we hit trailing zeroes).
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
@@ -276,10 +152,10 @@ ok( $node->log_contains(
 	"xl_tot_len zero");
 
 # xl_tot_len is < 24 (presumably recycled garbage).
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn, build_record_header(23));
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE, build_record_header(23));
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -289,10 +165,10 @@ ok( $node->log_contains(
 
 # xl_tot_len in final position, not big enough to span into a new page but
 # also not eligible for regular record header validation
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn, build_record_header(1));
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE, build_record_header(1));
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -301,10 +177,10 @@ ok( $node->log_contains(
 	"xl_tot_len short at end-of-page");
 
 # Need more pages, but xl_prev check fails first.
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
 $log_size = -s $node->logfile;
 $node->start;
@@ -313,12 +189,12 @@ ok( $node->log_contains(
 	"xl_prev bad");
 
 # xl_crc check fails.
-emit_message($node, 0);
-advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 10);
+$node->emit_message(0);
+$node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(10);
 $node->stop('immediate');
 # Corrupt a byte in that record, breaking its CRC.
-write_wal($node, $TLI, $end_lsn - 8, '!');
+$node->write_wal($TLI, $end_lsn - 8, $WAL_SEGMENT_SIZE, '!');
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -335,11 +211,11 @@ note "Multi-page end-of-WAL detection, header is not split";
 # written to WAL.
 
 # Good xl_prev, we hit zero page next (zero magic).
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
 $log_size = -s $node->logfile;
 $node->start;
@@ -347,16 +223,14 @@ ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
 	"xlp_magic zero");
 
 # Good xl_prev, we hit garbage page next (bad magic).
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
-write_wal(
-	$node, $TLI,
-	start_of_next_page($end_lsn),
-	build_page_header(0xcafe, 0, 1, 0));
+$node->write_wal($TLI, start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE, build_page_header(0xcafe, 0, 1, 0));
 $log_size = -s $node->logfile;
 $node->start;
 ok($node->log_contains("invalid magic number CAFE .* LSN .*", $log_size),
@@ -364,16 +238,14 @@ ok($node->log_contains("invalid magic number CAFE .* LSN .*", $log_size),
 
 # Good xl_prev, we hit typical recycled page (good xlp_magic, bad
 # xlp_pageaddr).
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
-write_wal(
-	$node, $TLI,
-	start_of_next_page($end_lsn),
-	build_page_header($XLP_PAGE_MAGIC, 0, 1, 0xbaaaaaad));
+$node->write_wal($TLI, start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE, build_page_header($XLP_PAGE_MAGIC, 0, 1, 0xbaaaaaad));
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -381,15 +253,16 @@ ok( $node->log_contains(
 	"xlp_pageaddr bad");
 
 # Good xl_prev, xlp_magic, xlp_pageaddr, but bogus xlp_info.
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 42, $prev_lsn));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, 0x1234, 1, start_of_next_page($end_lsn)));
 $log_size = -s $node->logfile;
@@ -399,15 +272,14 @@ ok($node->log_contains("invalid info bits 1234 in .*, LSN .*,", $log_size),
 
 # Good xl_prev, xlp_magic, xlp_pageaddr, but xlp_info doesn't mention
 # continuation record.
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 42, $prev_lsn));
-write_wal(
-	$node, $TLI,
-	start_of_next_page($end_lsn),
+$node->write_wal($TLI, start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header($XLP_PAGE_MAGIC, 0, 1, start_of_next_page($end_lsn)));
 $log_size = -s $node->logfile;
 $node->start;
@@ -416,15 +288,16 @@ ok($node->log_contains("there is no contrecord flag at .*", $log_size),
 
 # Good xl_prev, xlp_magic, xlp_pageaddr, xlp_info but xlp_rem_len doesn't add
 # up.
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 42, $prev_lsn));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, $XLP_FIRST_IS_CONTRECORD,
 		1, start_of_next_page($end_lsn),
@@ -441,10 +314,10 @@ note "Multi-page, but header is split, so page checks are done first";
 ###########################################################################
 
 # xl_prev is bad and xl_tot_len is too big, but we'll check xlp_magic first.
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
 $log_size = -s $node->logfile;
 $node->start;
@@ -452,14 +325,15 @@ ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
 	"xlp_magic zero (split record header)");
 
 # And we'll also check xlp_pageaddr before any header checks.
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, $XLP_FIRST_IS_CONTRECORD, 1, 0xbaaaaaad));
 $log_size = -s $node->logfile;
@@ -470,14 +344,15 @@ ok( $node->log_contains(
 
 # We'll also discover that xlp_rem_len doesn't add up before any
 # header checks,
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, $XLP_FIRST_IS_CONTRECORD,
 		1, start_of_next_page($end_lsn),
-- 
2.47.1

0002-Fix-incorrect-header-check-for-continuation-WAL-reco.patchtext/x-diff; charset=us-asciiDownload

From d5fab0db45577f5b6a77b1119847fa9832b174c9 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 15 Jan 2025 13:26:47 +0900
Subject: [PATCH 2/2] Fix incorrect header check for continuation WAL records
 on standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bbe2eea20..cf2b007806 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3438,12 +3438,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3465,6 +3465,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 56c464abb7..0428704dbf 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..b5983a6c62
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_message(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_message($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

#25

cyberdemn@gmail.com

12 months ago

In reply to: Michael Paquier (#24)

Re: Infinite loop in XLogPageRead() on standby

Hi Michael,

On Wed, 15 Jan 2025 at 05:45, Michael Paquier <michael@paquier.xyz> wrote:.

The new regression test is something I really want to keep around,
to be able to emulate the infinite loop, but I got annoyed with the
amount of duplication between the new test and the existing
039_end_of_wal.pl as there share the same ideas. I have refactored
039_end_of_wal.pl and moved its routines to generate WAL messages,
to advance WAL and to write WAL into Cluster.pm, then reused the
refactored result in the new test.

Barring objections, I'd like to get the main issue in 0002 fixed at
the beginning of next week. I am also planning to do the refactoring
bits tomorrow or so as these are rather straight-forward. The names
of the new routines in Cluster.pm are inherited from the existing
recovery test. Perhaps they could use a better name, but 0001 does
not look that bad to me either.

Thank you for picking it up. I briefly looked at both patches. The actual
fix in XLogPageRead() looks good to me.
I also agree with suggested refactoring, where there is certainly some room
for improvement - $WAL_SEGMENT_SIZE, $WAL_BLOCK_SIZE variables and
get_int_setting(), start_of_page() funcs are still duplicated in both test
files.
Maybe we can have something like the following in Cluster.pm? It'll allow
further simplify tests and reduce code duplication.

sub get_timeline_id
{
my self = shift;
return $self->safe_psql('postgres',
"SELECT timeline_id FROM pg_control_checkpoint()");
}

sub get_int_setting
{
my ($self, $name) = @_;
return int(
$self->safe_psql(
'postgres',
"SELECT setting FROM pg_settings WHERE name =
'$name'"));
}

sub WAL_SEGMENT_SIZE
{
my $self = shift;
self->{_WAL_SEGMENT_SIZE} = self->get_int_setting('wal_segment_size')
unless defined self->{_WAL_SEGMENT_SIZE};
return self->{_WAL_SEGMENT_SIZE};
}

sub WAL_BLOCK_SIZE
{
my $self = shift;
self->{_WAL_BLOCK_SIZE} = self->get_int_setting('wal_block_size')
unless defined self->{_WAL_BLOCK_SIZE};
return self->{_WAL_BLOCK_SIZE};
}

sub start_of_page
{
my ($self, $lsn) = @_;
return $lsn & ~($self->WAL_BLOCK_SIZE - 1);
}

Regards,
--
Alexander Kukushkin

#26

michael@paquier.xyz

12 months ago

In reply to: Alexander Kukushkin (#25)

6 attachment(s)

Re: Infinite loop in XLogPageRead() on standby

On Wed, Jan 15, 2025 at 10:35:42AM +0100, Alexander Kukushkin wrote:

Thank you for picking it up. I briefly looked at both patches. The actual
fix in XLogPageRead() looks good to me.
I also agree with suggested refactoring, where there is certainly some room
for improvement - $WAL_SEGMENT_SIZE, $WAL_BLOCK_SIZE variables and
get_int_setting(), start_of_page() funcs are still duplicated in both test
files.
Maybe we can have something like the following in Cluster.pm? It'll allow
further simplify tests and reduce code duplication.

Yeah, I was looking at that, but disliked a bit this option compared
to the routines doing WAL manipulations which are more complex with
their own ways of getting close to page limits, because I feel that we
should be smarter with the interfaces of these routines with more
options. For example, we have other things scanning control file
data like system_identifier in 040_pg_createsubscriber.p, so we could
have a SQL that wraps around the pg_control_* functions with a custom
field, like we do for "sub lsn" in Cluster.pm. Same comment for the
pg_settings queries, which are not limited to only what we do here.
The wrapper around integers is useful, of course, but we could make a
refactored routine apply a cast after checking its unit internally,
or something like that.

The routines for the start page position would not be fit within
Cluster.pm as it does not depend on a $node. Perhaps Utils.pm? Here
again, living with this small duplication felt OK in the scope of this
bug fix. I'm of course open to tuning all that, though my primary
goal is to wrap the fix :D

I've applied the first refactoring bits down to v13 (see for example a
s/emit_message/emit_wal/ tweaked for consistency, with more comment
tweaks). Attached are patches for each branch for the bug fix, that
I'm still testing and looking at more. The readability of
043_no_contrecord_switch.pl looks rather OK here.
--
Michael

Attachments:

v2-0002-Fix-incorrect-header-check-for-continuatio-master.patchtext/x-diff; charset=us-asciiDownload

From 8167332c82691ff9f28286705b501ac5f3b4794f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bbe2eea20..cf2b007806 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3438,12 +3438,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3465,6 +3465,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 56c464abb7..0428704dbf 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..a473d3e7d3
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

v2-0002-Fix-incorrect-header-check-for-continuation-W-v17.patchtext/x-diff; charset=us-asciiDownload

From a58c2c2d4a5cdd2f5c5cb1d2a8b301b7bddb2df7 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b45b833172..a94d4b4b78 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3436,12 +3436,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3463,6 +3463,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..7623cb1fe6 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..a473d3e7d3
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

v2-0002-Fix-incorrect-header-check-for-continuation-W-v16.patchtext/x-diff; charset=us-asciiDownload

From 1bbffa37813cfc39ba323d255ee07ea9f87ac189 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3c7fb913e7..6d48cb7e84 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3386,12 +3386,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3413,6 +3413,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index f5c1021a5e..24d0f5b0c1 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -45,6 +45,7 @@ tests += {
       't/037_invalid_database.pl',
       't/039_end_of_wal.pl',
       't/043_vacuum_horizon_floor.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..a473d3e7d3
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

v2-0002-Fix-incorrect-header-check-for-continuation-W-v15.patchtext/x-diff; charset=us-asciiDownload

From 234a62d98a8205b6e1e3fe0c2f9a7e61cf08678d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 2 files changed, 160 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1503b21671..bbc19df192 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3361,12 +3361,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3388,6 +3388,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..14c3ac3d0e
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_catchup('standby2', 'replay', $standby1->lsn('flush'));
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

v2-0002-Fix-incorrect-header-check-for-continuation-W-v14.patchtext/x-diff; charset=us-asciiDownload

From 7422ad5a702773867fbab7d3becaeca1f460488c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 08:43:13 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlog.c             |  24 +--
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 2 files changed, 165 insertions(+), 12 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e2cb88f3..0334c2d3bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12562,18 +12562,17 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
-	 * recycled WAL segment present in pg_wal, with garbage contents, however.
-	 * We would read the first page from the local WAL segment, but when
-	 * reading the second page, we would read the bogus, recycled, WAL
-	 * segment. If we didn't catch that case here, we would never recover,
-	 * because ReadRecord() would retry reading the whole record from the
-	 * beginning.
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
+	 * WAL segment present in pg_wal, with garbage contents, however. We would
+	 * read the first page from the local WAL segment, but when reading the
+	 * second page, we would read the bogus, recycled, WAL segment. If we
+	 * didn't catch that case here, we would never recover, because
+	 * ReadRecord() would retry reading the whole record from the beginning.
 	 *
 	 * Of course, this only catches errors in the page header, which is what
 	 * happens in the case of a recycled WAL segment. Other kinds of errors or
@@ -12589,6 +12588,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..14c3ac3d0e
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_catchup('standby2', 'replay', $standby1->lsn('flush'));
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

v2-0002-Fix-incorrect-header-check-for-continuation-W-v13.patchtext/x-diff; charset=us-asciiDownload

From 1bc8a5c47c969321f5882e5e759e94427e068901 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 08:43:13 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlog.c             |  13 +-
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 2 files changed, 160 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3277a1cbe6..769f4a202f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12336,12 +12336,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled in the master. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the master. There is a recycled
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * WAL segment present in pg_wal, with garbage contents, however. We would
 	 * read the first page from the local WAL segment, but when reading the
 	 * second page, we would read the bogus, recycled, WAL segment. If we
@@ -12362,6 +12362,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..14c3ac3d0e
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_catchup('standby2', 'replay', $standby1->lsn('flush'));
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

#27