Assertion failure with summarize_wal enabled during pg_createsubscriber

Started by Fujii Masaoover 1 year ago13 messages

masao.fujii@oss.nttdata.com

over 1 year ago

Hi,

In HEAD, I encountered the following assertion failure when I enabled summarize_wal
and ran pg_createsubscriber.

2024-07-01 14:42:15.697 JST [19195] LOG: database system is ready to accept connections
TRAP: failed Assert("switchpoint >= state->EndRecPtr"), File: "walsummarizer.c", Line: 1382, PID: 19200
0 postgres 0x0000000105c46c5d ExceptionalCondition + 189
1 postgres 0x000000010590b1e4 summarizer_read_local_xlog_page + 340
2 postgres 0x00000001054e401e ReadPageInternal + 542
3 postgres 0x00000001054e24c0 XLogDecodeNextRecord + 464
4 postgres 0x00000001054e2283 XLogReadAhead + 67
5 postgres 0x00000001054e2185 XLogReadRecord + 53
6 postgres 0x000000010590a3ab SummarizeWAL + 1115
7 postgres 0x000000010590963a WalSummarizerMain + 1242
8 postgres 0x00000001058fd10a postmaster_child_launch + 234
9 postgres 0x000000010590133d StartChildProcess + 29
10 postgres 0x0000000105904582 MaybeStartWalSummarizer + 82
11 postgres 0x0000000105901af1 ServerLoop + 1153
12 postgres 0x00000001059007ca PostmasterMain + 6554
13 postgres 0x00000001057a3782 main + 818
14 dyld 0x00007ff80e5e2366 start + 1942
2024-07-01 14:42:15.912 JST [19195] LOG: WAL summarizer process (PID 19200) was terminated by signal 6: Abort trap: 6
2024-07-01 14:42:15.913 JST [19195] LOG: terminating any other active server processes

Here are the steps to reproduce this issue.

--------------------------------
initdb -D pub
cat <<EOF >> pub/postgresql.conf
wal_level = 'logical'
summarize_wal = on
EOF
pg_ctl -D pub start
pgbench -i
pgbench -T 600 &
pg_basebackup -D sub -c fast -R
pg_createsubscriber -d postgres -D sub -p 5433 -P "port=5432"
--------------------------------

Is this the known issue?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Fujii Masao (#1)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Mon, Jul 01, 2024 at 02:54:56PM +0900, Fujii Masao wrote:

In HEAD, I encountered the following assertion failure when I enabled summarize_wal
and ran pg_createsubscriber.

Is this the known issue?

Nope. So, Open Item, here we go.
--
Michael

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Michael Paquier (#2)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Mon, Jul 1, 2024 at 2:08 AM Michael Paquier <michael@paquier.xyz> wrote:

Nope. So, Open Item, here we go.

Some initial investigation:

In this test case, pg_subscriber, during the "starting the subscriber"
section of its work, starts up the database in the "sub" directory as
a standby. It enters standby mode, begins redo, and is then promoted,
selecting timeline 2. The WAL summarizer is supposed to end
summarization at the point at which timeline 1 ended and then resume
summarizing from the beginning of timeline 2. But instead, it fails an
assertion:

Assert(switchpoint >= state->EndRecPtr);

This assertion is trying to verify that, when a new timeline is
spawned, we don't read past the switchpoint on the original timeline.
Here, we have apparently done that. In one test, I got switchpoint =
0/51000510, state->EndRecPtr = 0/51000600. According to pg_waldump, on
timeline 1, we have this record at that LSN:

rmgr: Heap len (rec/tot): 54/ 54, tx: 2313637, lsn:
0/51000510, prev 0/510004D0, desc: DELETE xmax: 2313637, off: 3,
infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/6104 blk
0

And on timeline 2, we have this at that LSN:

rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/51000510, prev 0/510004D0, desc: CHECKPOINT_SHUTDOWN redo
0/51000510; tli 2; prev tli 1; fpw true; xid 0:2313638; oid 24576;
multi 1; offset 0; oldest xid 730 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown

It appears that pg_subscriber creates a recovery.conf that includes:

recovery_target_timeline = 'latest'
recovery_target_inclusive = true
recovery_target_lsn = '%X/%X'

...where %X/%X represents a valid LSN.

I think the problem here is that the WAL summarizer believes that when
a new timeline appears, it should pick up from where the old timeline
ended. And here, that doesn't happen: the new timeline branches off
before the end of the old timeline, because of the recovery target.

I'm not yet sure what should be done about this. The obvious answer is
"remove the assertion," and maybe that is all we need to do. However,
I'm not quite sure what the actual behavior will be if we just do
that, so I think more investigation is needed. I'll keep looking at
this, although given the US holiday I may not have results until next
week.

--
Robert Haas
EDB: http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Robert Haas (#3)

1 attachment(s)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Wed, Jul 3, 2024 at 1:07 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think the problem here is that the WAL summarizer believes that when
a new timeline appears, it should pick up from where the old timeline
ended. And here, that doesn't happen: the new timeline branches off
before the end of the old timeline, because of the recovery target.

I'm not yet sure what should be done about this. The obvious answer is
"remove the assertion," and maybe that is all we need to do. However,
I'm not quite sure what the actual behavior will be if we just do
that, so I think more investigation is needed. I'll keep looking at
this, although given the US holiday I may not have results until next
week.

Here is a draft patch that attempts to fix this problem. I'm not
certain that it's completely correct, but it does seem to fix the
reported issue.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Fix-WAL-summarization-across-abrupt-timeline-swit.patchapplication/octet-stream; name=v1-0001-Fix-WAL-summarization-across-abrupt-timeline-swit.patchDownload

From 26f9e45c45bad2aa63e00008ed0470f017da4203 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 10 Jul 2024 16:56:14 -0400
Subject: [PATCH v1] Fix WAL summarization across abrupt timeline switch.

The old code believed that it was not possible to switch timelines
without first replaying all of the WAL from the old timeline, but
that turns out to be false, as demonstrated by an example from Fujii
Masao. As a result, it assumed that summarization would always
continue from the LSN where summarization previously ended. But in
fact, when a timeline switch occurs without replaying all the WAL
from the previous timeline, we can need to back up to an earlier
LSN. Adjust accordingly.
---
 src/backend/postmaster/walsummarizer.c | 32 +++++++++++++++-----------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 55dc2ea8f5..da6f884d0c 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -387,13 +387,27 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 
 		/*
 		 * If we've reached the switch LSN, we can't summarize anything else
-		 * on this timeline. Switch to the next timeline and go around again.
+		 * on this timeline. Switch to the next timeline and go around again,
+		 * backing up to the exact switch point if we passed it.
 		 */
 		if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
 		{
+			/* Restart summarization from switch point. */
 			current_tli = switch_tli;
+			current_lsn = switch_lsn;
+
+			/* Next timeline and switch point, if any, not yet known. */
 			switch_lsn = InvalidXLogRecPtr;
 			switch_tli = 0;
+
+			/* Update (really, rewind, if needed) state in shared memory. */
+			LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+			WalSummarizerCtl->summarized_lsn = current_lsn;
+			WalSummarizerCtl->summarized_tli = current_tli;
+			WalSummarizerCtl->lsn_is_exact = true;
+			WalSummarizerCtl->pending_lsn = current_lsn;
+			LWLockRelease(WALSummarizerLock);
+
 			continue;
 		}
 
@@ -414,7 +428,6 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 
 		/* Update state in shared memory. */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
-		Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
 		WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
 		WalSummarizerCtl->summarized_tli = current_tli;
 		WalSummarizerCtl->lsn_is_exact = true;
@@ -1026,7 +1039,6 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 
 		/* Also update shared memory. */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
-		Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
 		Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
 		WalSummarizerCtl->pending_lsn = summary_end_lsn;
 		LWLockRelease(WALSummarizerLock);
@@ -1369,17 +1381,11 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 					 * Allow reads up to exactly the switch point.
 					 *
 					 * It's possible that this will cause read_upto to move
-					 * backwards, because walreceiver might have read a
-					 * partial record and flushed it to disk, and we'd view
-					 * that data as safe to read. However, the
-					 * XLOG_END_OF_RECOVERY record will be written at the end
-					 * of the last complete WAL record, not at the end of the
-					 * WAL that we've flushed to disk.
-					 *
-					 * So switchpoint < private->read_upto is possible here,
-					 * but switchpoint < state->EndRecPtr should not be.
+					 * backwards, because we might have been promoted before
+					 * reaching the end of the previous timeline. In that case,
+					 * the next loop iteration will likely conclude that we've
+					 * reached end of WAL.
 					 */
-					Assert(switchpoint >= state->EndRecPtr);
 					private_data->read_upto = switchpoint;
 
 					/* Debugging output. */
-- 
2.39.3 (Apple Git-145)

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Robert Haas (#4)

3 attachment(s)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Wed, Jul 10, 2024 at 5:01 PM Robert Haas <robertmhaas@gmail.com> wrote:

Here is a draft patch that attempts to fix this problem. I'm not
certain that it's completely correct, but it does seem to fix the
reported issue.

I tried to write a test case for this and discovered that there are
actually two separate problems in this area. First, as shown by the
assertion failure reported by Fujii Masao, the WAL summarizer thinks
that it should never need to back up to an earlier LSN, and the test
case he provided shows that this is incorrect. Second, the WAL
summarizer can end up in a bad state after the startup process renames
the last WAL file on the old timeline to a .partial file. If this
happens before the file has been summarized, then the WAL summarizer
can't access it any more and errors out. Promotion also removes WAL
files from the old timeline completely, but only if they're after the
switch point, and summarization doesn't care about those anyway. So
the partial file seems to be the only problem case.

In theory, the problem with the partial file could be handled in a
variety of ways: we could teach summarization to read the partial
file, perhaps, or postpone adding the .partial suffix until after
summarization has happened. But in practice, given where we are in the
release cycle, the only reasonable approach that I can see is to have
promotion wait for summarization to catch up, so that's what I did in
0003.

0002 is the same as what I posted previously as 0001, and teaches the
summarizer about backing up when we switch timelines. 0001 adds a
missing call to ConditionVariableCancelSleep; AFAIK, that omission has
no important consequences, but still seems like it should be fixed.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v2-0002-Allow-WAL-summarization-to-back-up-when-timeline-.patchapplication/octet-stream; name=v2-0002-Allow-WAL-summarization-to-back-up-when-timeline-.patchDownload

From 017e5aefdc959e81207b423f54f73e54b6a420a9 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 10 Jul 2024 16:56:14 -0400
Subject: [PATCH v2 2/3] Allow WAL summarization to back up when timeline
 changes.

The old code believed that it was not possible to switch timelines
without first replaying all of the WAL from the old timeline, but
that turns out to be false, as demonstrated by an example from Fujii
Masao. As a result, it assumed that summarization would always
continue from the LSN where summarization previously ended. But in
fact, when a timeline switch occurs without replaying all the WAL
from the previous timeline, we can need to back up to an earlier
LSN. Adjust accordingly.
---
 src/backend/postmaster/walsummarizer.c | 32 +++++++++++++++-----------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 191b360bef..c1dd75b98c 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -388,13 +388,27 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 
 		/*
 		 * If we've reached the switch LSN, we can't summarize anything else
-		 * on this timeline. Switch to the next timeline and go around again.
+		 * on this timeline. Switch to the next timeline and go around again,
+		 * backing up to the exact switch point if we passed it.
 		 */
 		if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
 		{
+			/* Restart summarization from switch point. */
 			current_tli = switch_tli;
+			current_lsn = switch_lsn;
+
+			/* Next timeline and switch point, if any, not yet known. */
 			switch_lsn = InvalidXLogRecPtr;
 			switch_tli = 0;
+
+			/* Update (really, rewind, if needed) state in shared memory. */
+			LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+			WalSummarizerCtl->summarized_lsn = current_lsn;
+			WalSummarizerCtl->summarized_tli = current_tli;
+			WalSummarizerCtl->lsn_is_exact = true;
+			WalSummarizerCtl->pending_lsn = current_lsn;
+			LWLockRelease(WALSummarizerLock);
+
 			continue;
 		}
 
@@ -415,7 +429,6 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 
 		/* Update state in shared memory. */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
-		Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
 		WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
 		WalSummarizerCtl->summarized_tli = current_tli;
 		WalSummarizerCtl->lsn_is_exact = true;
@@ -1060,7 +1073,6 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 
 		/* Also update shared memory. */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
-		Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
 		Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
 		WalSummarizerCtl->pending_lsn = summary_end_lsn;
 		LWLockRelease(WALSummarizerLock);
@@ -1460,17 +1472,11 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 					 * Allow reads up to exactly the switch point.
 					 *
 					 * It's possible that this will cause read_upto to move
-					 * backwards, because walreceiver might have read a
-					 * partial record and flushed it to disk, and we'd view
-					 * that data as safe to read. However, the
-					 * XLOG_END_OF_RECOVERY record will be written at the end
-					 * of the last complete WAL record, not at the end of the
-					 * WAL that we've flushed to disk.
-					 *
-					 * So switchpoint < private->read_upto is possible here,
-					 * but switchpoint < state->EndRecPtr should not be.
+					 * backwards, because we might have been promoted before
+					 * reaching the end of the previous timeline. In that case,
+					 * the next loop iteration will likely conclude that we've
+					 * reached end of WAL.
 					 */
-					Assert(switchpoint >= state->EndRecPtr);
 					private_data->read_upto = switchpoint;
 
 					/* Debugging output. */
-- 
2.39.3 (Apple Git-145)

v2-0003-Wait-for-WAL-summarization-to-catch-up-before-cre.patchapplication/octet-stream; name=v2-0003-Wait-for-WAL-summarization-to-catch-up-before-cre.patchDownload

From fb27ff5f792d41dc518740b16a06fc1255446b5b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 17 Jul 2024 14:52:35 -0400
Subject: [PATCH v2 3/3] Wait for WAL summarization to catch up before creating
 .partial file.

When a standby is promoted, CleanupAfterArchiveRecovery() may decide
to rename the final WAL file from the old timeline by adding ".partial"
to the name. If WAL summarization is enabled and this file is renamed
before its partial contents are summarized, WAL summarization breaks:
the summarizer gets stuck at that point in the WAL stream and just
errors out.

To fix that, first make the startup process wait for WAL summarization
to catch up before renaming the file. Generally, this should be quick,
and if it's not, the user can shut off summarize_wal and try again.
To make this fix work, also teach the WAL summarizer that after a
promotion has occurred, no more WAL can appear on the previous
timeline: previously, the WAL summarizer wouldn't switch to the new
timeline until we actually started writing WAL there, but that meant
that when the startup process was waiting for the WAL summarizer, it
was waiting for an action that the summarizer wasn't yet prepared to
take.

In the process of fixing these bugs, I realized that the logic to wait
for WAL summarization to catch up was spread out in a way that made
it difficult to reuse properly, so this code refactors things to make
it easier.

Finally, add a test case that would have caught this bug and the
previously-fixed bug that WAL summarization sometimes needs to back up
when the timeline changes.
---
 src/backend/access/transam/xlog.c           |  33 +++++
 src/backend/backup/basebackup_incremental.c |  89 +-----------
 src/backend/postmaster/walsummarizer.c      | 142 ++++++++++++++++----
 src/bin/pg_combinebackup/meson.build        |   1 +
 src/bin/pg_combinebackup/t/008_promote.pl   |  81 +++++++++++
 src/include/access/xlog.h                   |   1 +
 src/include/postmaster/walsummarizer.h      |   3 +-
 7 files changed, 241 insertions(+), 109 deletions(-)
 create mode 100644 src/bin/pg_combinebackup/t/008_promote.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 636be5ca4d..e2679a11d7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -500,6 +500,11 @@ typedef struct XLogCtlData
 	 * If we create a new timeline when the system was started up,
 	 * PrevTimeLineID is the old timeline's ID that we forked off from.
 	 * Otherwise it's equal to InsertTimeLineID.
+	 *
+	 * We set these fields while holding info_lck. Most that reads these
+	 * values knows that recovery is no longer in progress and so can safely
+	 * read the value without a lock, but code that could be run either during
+	 * or after recovery can take info_lck while reading these values.
 	 */
 	TimeLineID	InsertTimeLineID;
 	TimeLineID	PrevTimeLineID;
@@ -5316,6 +5321,13 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 			char		partialfname[MAXFNAMELEN];
 			char		partialpath[MAXPGPATH];
 
+			/*
+			 * If we're summarizing WAL, we can't rename the partial file until
+			 * the summarizer finishes with it, else it will fail.
+			 */
+			if (summarize_wal)
+				WaitForWalSummarization(EndOfLog);
+
 			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
 			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
 			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
@@ -5946,8 +5958,10 @@ StartupXLOG(void)
 	}
 
 	/* Save the selected TimeLineID in shared memory, too */
+	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->InsertTimeLineID = newTLI;
 	XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
+	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
 	 * Actually, if WAL ended in an incomplete record, skip the parts that
@@ -6482,6 +6496,25 @@ GetWALInsertionTimeLine(void)
 	return XLogCtl->InsertTimeLineID;
 }
 
+/*
+ * GetWALInsertionTimeLineIfSet -- If the system is not in recovery, returns
+ * the WAL insertion timeline; else, returns 0. Wherever possible, use
+ * GetWALInsertionTimeLine() instead, since it's cheaper. Note that this
+ * function decides recovery has ended as soon as the insert TLI is set, which
+ * happens before we set XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE.
+ */
+TimeLineID
+GetWALInsertionTimeLineIfSet(void)
+{
+	TimeLineID	insertTLI;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	insertTLI = XLogCtl->InsertTimeLineID;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return insertTLI;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 2108702397..59c1bd7ed8 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -278,11 +278,6 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
 	XLogRecPtr	earliest_wal_range_start_lsn = InvalidXLogRecPtr;
 	TimeLineID	latest_wal_range_tli = 0;
 	XLogRecPtr	summarized_lsn;
-	XLogRecPtr	pending_lsn;
-	XLogRecPtr	prior_pending_lsn = InvalidXLogRecPtr;
-	int			deadcycles = 0;
-	TimestampTz initial_time,
-				current_time;
 
 	Assert(ib->buf.data == NULL);
 
@@ -457,85 +452,13 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
 	}
 
 	/*
-	 * Wait for WAL summarization to catch up to the backup start LSN (but
-	 * time out if it doesn't do so quickly enough).
+	 * Wait for WAL summarization to catch up to the backup start LSN. This
+	 * will throw an error if the WAL summarizer appears to be stuck. If WAL
+	 * summarization gets disabled while we're waiting, this will return
+	 * immediately, and we'll error out further down if the WAL summaries are
+	 * incomplete.
 	 */
-	initial_time = current_time = GetCurrentTimestamp();
-	while (1)
-	{
-		long		timeout_in_ms = 10000;
-		long		elapsed_seconds;
-
-		/*
-		 * Align the wait time to prevent drift. This doesn't really matter,
-		 * but we'd like the warnings about how long we've been waiting to say
-		 * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
-		 * drifting to something that is not a multiple of ten.
-		 */
-		timeout_in_ms -=
-			TimestampDifferenceMilliseconds(initial_time, current_time) %
-			timeout_in_ms;
-
-		/* Wait for up to 10 seconds. */
-		summarized_lsn = WaitForWalSummarization(backup_state->startpoint,
-												 timeout_in_ms, &pending_lsn);
-
-		/* If WAL summarization has progressed sufficiently, stop waiting. */
-		if (summarized_lsn >= backup_state->startpoint)
-			break;
-
-		/*
-		 * Keep track of the number of cycles during which there has been no
-		 * progression of pending_lsn. If pending_lsn is not advancing, that
-		 * means that not only are no new files appearing on disk, but we're
-		 * not even incorporating new records into the in-memory state.
-		 */
-		if (pending_lsn > prior_pending_lsn)
-		{
-			prior_pending_lsn = pending_lsn;
-			deadcycles = 0;
-		}
-		else
-			++deadcycles;
-
-		/*
-		 * If we've managed to wait for an entire minute without the WAL
-		 * summarizer absorbing a single WAL record, error out; probably
-		 * something is wrong.
-		 *
-		 * We could consider also erroring out if the summarizer is taking too
-		 * long to catch up, but it's not clear what rate of progress would be
-		 * acceptable and what would be too slow. So instead, we just try to
-		 * error out in the case where there's no progress at all. That seems
-		 * likely to catch a reasonable number of the things that can go wrong
-		 * in practice (e.g. the summarizer process is completely hung, say
-		 * because somebody hooked up a debugger to it or something) without
-		 * giving up too quickly when the system is just slow.
-		 */
-		if (deadcycles >= 6)
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("WAL summarization is not progressing"),
-					 errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
-							   LSN_FORMAT_ARGS(backup_state->startpoint),
-							   LSN_FORMAT_ARGS(summarized_lsn),
-							   LSN_FORMAT_ARGS(pending_lsn))));
-
-		/*
-		 * Otherwise, just let the user know what's happening.
-		 */
-		current_time = GetCurrentTimestamp();
-		elapsed_seconds =
-			TimestampDifferenceMilliseconds(initial_time, current_time) / 1000;
-		ereport(WARNING,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("still waiting for WAL summarization through %X/%X after %ld seconds",
-						LSN_FORMAT_ARGS(backup_state->startpoint),
-						elapsed_seconds),
-				 errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
-						   LSN_FORMAT_ARGS(summarized_lsn),
-						   LSN_FORMAT_ARGS(pending_lsn))));
-	}
+	WaitForWalSummarization(backup_state->startpoint);
 
 	/*
 	 * Retrieve a list of all WAL summaries on any timeline that overlap with
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index c1dd75b98c..508561e194 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -650,54 +650,132 @@ SetWalSummarizerLatch(void)
 }
 
 /*
- * Wait until WAL summarization reaches the given LSN, but not longer than
- * the given timeout.
+ * Wait until WAL summarization reaches the given LSN, but time out with an
+ * error if the summarizer seems to be stick.
  *
- * The return value is the first still-unsummarized LSN. If it's greater than
- * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
- *
- * Either way, *pending_lsn is set to the value taken from WalSummarizerCtl.
+ * Returns immediately if summarize_wal is turned off while we wait. Caller
+ * is expected to handle this case, if necessary.
  */
-XLogRecPtr
-WaitForWalSummarization(XLogRecPtr lsn, long timeout, XLogRecPtr *pending_lsn)
+void
+WaitForWalSummarization(XLogRecPtr lsn)
 {
-	TimestampTz start_time = GetCurrentTimestamp();
-	TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
-	XLogRecPtr	summarized_lsn;
+	TimestampTz initial_time,
+				cycle_time,
+				current_time;
+	XLogRecPtr	prior_pending_lsn = InvalidXLogRecPtr;
+	int			deadcycles = 0;
 
-	Assert(!XLogRecPtrIsInvalid(lsn));
-	Assert(timeout > 0);
+	initial_time = cycle_time = GetCurrentTimestamp();
 
 	while (1)
 	{
-		TimestampTz now;
-		long		remaining_timeout;
+		long		timeout_in_ms = 10000;
+		XLogRecPtr	summarized_lsn;
+		XLogRecPtr	pending_lsn;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* If WAL summarization is disabled while we're waiting, give up. */
+		if (!summarize_wal)
+			return;
 
 		/*
 		 * If the LSN summarized on disk has reached the target value, stop.
 		 */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
 		summarized_lsn = WalSummarizerCtl->summarized_lsn;
-		*pending_lsn = WalSummarizerCtl->pending_lsn;
+		pending_lsn = WalSummarizerCtl->pending_lsn;
 		LWLockRelease(WALSummarizerLock);
+
+		/* If WAL summarization has progressed sufficiently, stop waiting. */
 		if (summarized_lsn >= lsn)
 			break;
 
-		/* Timeout reached? If yes, stop. */
-		now = GetCurrentTimestamp();
-		remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
-		if (remaining_timeout <= 0)
-			break;
+		/* Recheck current time. */
+		current_time = GetCurrentTimestamp();
+
+		/* Have we finished the current cycle of waiting? */
+		if (TimestampDifferenceMilliseconds(cycle_time,
+											current_time) >= timeout_in_ms)
+		{
+			long		elapsed_seconds;
+
+			/* Begin new wait cycle. */
+			cycle_time = TimestampTzPlusMilliseconds(cycle_time,
+													 timeout_in_ms);
+
+			/*
+			 * Keep track of the number of cycles during which there has been
+			 * no progression of pending_lsn. If pending_lsn is not advancing,
+			 * that means that not only are no new files appearing on disk, but
+			 * we're not even incorporating new records into the in-memory
+			 * state.
+			 */
+			if (pending_lsn > prior_pending_lsn)
+			{
+				prior_pending_lsn = pending_lsn;
+				deadcycles = 0;
+			}
+			else
+				++deadcycles;
+
+			/*
+			 * If we've managed to wait for an entire minute without the WAL
+			 * summarizer absorbing a single WAL record, error out; probably
+			 * something is wrong.
+			 *
+			 * We could consider also erroring out if the summarizer is taking
+			 * too long to catch up, but it's not clear what rate of progress
+			 * would be acceptable and what would be too slow. So instead, we
+			 * just try to error out in the case where there's no progress at
+			 * all. That seems likely to catch a reasonable number of the
+			 * things that can go wrong in practice (e.g. the summarizer
+			 * process is completely hung, say because somebody hooked up a
+			 * debugger to it or something) without giving up too quickly when
+			 * the system is just slow.
+			 */
+			if (deadcycles >= 6)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("WAL summarization is not progressing"),
+						 errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
+								   LSN_FORMAT_ARGS(lsn),
+								   LSN_FORMAT_ARGS(summarized_lsn),
+								   LSN_FORMAT_ARGS(pending_lsn))));
+
+
+			/*
+			 * Otherwise, just let the user know what's happening.
+			 */
+			elapsed_seconds =
+				TimestampDifferenceMilliseconds(initial_time,
+												current_time) / 1000;
+			ereport(WARNING,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("still waiting for WAL summarization through %X/%X after %ld seconds",
+							LSN_FORMAT_ARGS(lsn),
+							elapsed_seconds),
+					 errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
+							   LSN_FORMAT_ARGS(summarized_lsn),
+							   LSN_FORMAT_ARGS(pending_lsn))));
+		}
+
+		/*
+		 * Align the wait time to prevent drift. This doesn't really matter,
+		 * but we'd like the warnings about how long we've been waiting to say
+		 * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
+		 * drifting to something that is not a multiple of ten.
+		 */
+		timeout_in_ms -=
+			TimestampDifferenceMilliseconds(cycle_time, current_time);
 
 		/* Wait and see. */
 		ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
-									remaining_timeout,
+									timeout_in_ms,
 									WAIT_EVENT_WAL_SUMMARY_READY);
 	}
 
 	ConditionVariableCancelSleep();
-
-	return summarized_lsn;
 }
 
 /*
@@ -730,6 +808,22 @@ GetLatestLSN(TimeLineID *tli)
 		TimeLineID	flush_tli;
 		XLogRecPtr	replay_lsn;
 		TimeLineID	replay_tli;
+		TimeLineID	insert_tli;
+
+		/*
+		 * After the insert TLI has been set and before the control file has
+		 * been updated to show the DB in production, RecoveryInProgress()
+		 * will return true, because it's not yet safe for all backends to
+		 * begin writing WAL. However, replay has already ceased, so from our
+		 * point of view, recovery is already over. We should summarize up to
+		 * where replay stopped and then prepare to resume at the start of the
+		 * insert timeline.
+		 */
+		if ((insert_tli = GetWALInsertionTimeLineIfSet()) != 0)
+		{
+			*tli = insert_tli;
+			return GetXLogReplayRecPtr(NULL);
+		}
 
 		/*
 		 * What we really want to know is how much WAL has been flushed to
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
index d871b2e3b8..d142608e94 100644
--- a/src/bin/pg_combinebackup/meson.build
+++ b/src/bin/pg_combinebackup/meson.build
@@ -35,6 +35,7 @@ tests += {
       't/005_integrity.pl',
       't/006_db_file_copy.pl',
       't/007_wal_level_minimal.pl',
+      't/008_promote.pl',
     ],
   }
 }
diff --git a/src/bin/pg_combinebackup/t/008_promote.pl b/src/bin/pg_combinebackup/t/008_promote.pl
new file mode 100644
index 0000000000..1154a5d8b2
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/008_promote.pl
@@ -0,0 +1,81 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+#
+# Test whether WAL summaries are complete such that incremental backup
+# can be performed after promoting a standby at an arbitrary LSN.
+
+use strict;
+use warnings FATAL => 'all';
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_COMBINEBACKUP_MODE} || '--copy';
+
+note "testing using mode $mode";
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->append_conf('postgresql.conf', 'log_min_messages = debug1');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'avocado');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+	[ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+	"full backup from node1");
+
+# Checkpoint and record LSN after.
+$node1->safe_psql('postgres', 'CHECKPOINT');
+my $lsn = $node1->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now create a second node. We want this to stream from the first node and
+# then stop recovery at some arbitrary LSN, not just when it hits the end of
+# WAL, so use a recovery target.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup1', 'has_streaming' => 1);
+$node2->append_conf('postgresql.conf', <<EOM);
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'pause'
+EOM
+$node2->start();
+
+# Wait until recoveery pauses, then promote.
+$node2->poll_query_until('postgres', "SELECT pg_get_wal_replay_pause_state() = 'paused';");
+$node2->safe_psql('postgres', "SELECT pg_promote()");
+
+# Once promotion occurs, insert a second row on the new node.
+$node2->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';");
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'blackberry');
+EOM
+
+# Now take an incremental backup. If WAL summarization didn't follow the
+# timeline cange correctly, something should break at this point.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node2->command_ok(
+	[ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+	  '--incremental', $backup1path . '/backup_manifest' ],
+	"incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup2',
+						 combine_with_prior => [ 'backup1' ]);
+$node3->start();
+
+done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1a1f11a943..2c507ea618 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -248,6 +248,7 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
+extern TimeLineID GetWALInsertionTimeLineIfSet(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
index 112bc1e6cb..aedca55676 100644
--- a/src/include/postmaster/walsummarizer.h
+++ b/src/include/postmaster/walsummarizer.h
@@ -30,7 +30,6 @@ extern void GetWalSummarizerState(TimeLineID *summarized_tli,
 extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
 										   bool *lsn_is_exact);
 extern void SetWalSummarizerLatch(void);
-extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
-										  XLogRecPtr *pending_lsn);
+extern void WaitForWalSummarization(XLogRecPtr lsn);
 
 #endif
-- 
2.39.3 (Apple Git-145)

v2-0001-Add-missing-call-to-ConditionVariableCancelSleep.patchapplication/octet-stream; name=v2-0001-Add-missing-call-to-ConditionVariableCancelSleep.patchDownload

From 3a5304d30e299dc836854a98597f4a09267894a1 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 17 Jul 2024 14:53:00 -0400
Subject: [PATCH v2 1/3] Add missing call to ConditionVariableCancelSleep().

After calling ConditionVariableSleep() or ConditionVariableTimedSleep()
one or more times, code is supposed to call ConditionVariableCancelSleep()
to remove itself from the waitlist. This code neglected to do so.
As far as I know, that had no observable consequences, but let's make
the code correct.
---
 src/backend/postmaster/walsummarizer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 83c178e766..191b360bef 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -682,6 +682,8 @@ WaitForWalSummarization(XLogRecPtr lsn, long timeout, XLogRecPtr *pending_lsn)
 									WAIT_EVENT_WAL_SUMMARY_READY);
 	}
 
+	ConditionVariableCancelSleep();
+
 	return summarized_lsn;
 }
 
-- 
2.39.3 (Apple Git-145)

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Robert Haas (#5)

2 attachment(s)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

I went ahead and committed 0001, which is pretty trivial. Hopefully,
at least, it's trivial enough that I didn't mess it up.

Here's a rebase of 0002 and 0003, now 0001 and 0002, with one minor
fix to hopefully avoid annoying CI.

I'm still hoping for some review/feedback/testing of these before I
commit them, but I also don't want to wait too long.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v3-0002-Wait-for-WAL-summarization-to-catch-up-before-cre.patchapplication/octet-stream; name=v3-0002-Wait-for-WAL-summarization-to-catch-up-before-cre.patchDownload

From 64e4e1245ea6d4c02d98da9f5a9ed4b36a26eb7f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 22 Jul 2024 10:20:43 -0400
Subject: [PATCH v3 2/2] Wait for WAL summarization to catch up before creating
 .partial file.

When a standby is promoted, CleanupAfterArchiveRecovery() may decide
to rename the final WAL file from the old timeline by adding ".partial"
to the name. If WAL summarization is enabled and this file is renamed
before its partial contents are summarized, WAL summarization breaks:
the summarizer gets stuck at that point in the WAL stream and just
errors out.

To fix that, first make the startup process wait for WAL summarization
to catch up before renaming the file. Generally, this should be quick,
and if it's not, the user can shut off summarize_wal and try again.
To make this fix work, also teach the WAL summarizer that after a
promotion has occurred, no more WAL can appear on the previous
timeline: previously, the WAL summarizer wouldn't switch to the new
timeline until we actually started writing WAL there, but that meant
that when the startup process was waiting for the WAL summarizer, it
was waiting for an action that the summarizer wasn't yet prepared to
take.

In the process of fixing these bugs, I realized that the logic to wait
for WAL summarization to catch up was spread out in a way that made
it difficult to reuse properly, so this code refactors things to make
it easier.

Finally, add a test case that would have caught this bug and the
previously-fixed bug that WAL summarization sometimes needs to back up
when the timeline changes.
---
 src/backend/access/transam/xlog.c           |  33 +++++
 src/backend/backup/basebackup_incremental.c |  90 +------------
 src/backend/postmaster/walsummarizer.c      | 142 ++++++++++++++++----
 src/bin/pg_combinebackup/meson.build        |   1 +
 src/bin/pg_combinebackup/t/008_promote.pl   |  81 +++++++++++
 src/include/access/xlog.h                   |   1 +
 src/include/postmaster/walsummarizer.h      |   3 +-
 7 files changed, 241 insertions(+), 110 deletions(-)
 create mode 100644 src/bin/pg_combinebackup/t/008_promote.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 636be5ca4d..e2679a11d7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -500,6 +500,11 @@ typedef struct XLogCtlData
 	 * If we create a new timeline when the system was started up,
 	 * PrevTimeLineID is the old timeline's ID that we forked off from.
 	 * Otherwise it's equal to InsertTimeLineID.
+	 *
+	 * We set these fields while holding info_lck. Most that reads these
+	 * values knows that recovery is no longer in progress and so can safely
+	 * read the value without a lock, but code that could be run either during
+	 * or after recovery can take info_lck while reading these values.
 	 */
 	TimeLineID	InsertTimeLineID;
 	TimeLineID	PrevTimeLineID;
@@ -5316,6 +5321,13 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 			char		partialfname[MAXFNAMELEN];
 			char		partialpath[MAXPGPATH];
 
+			/*
+			 * If we're summarizing WAL, we can't rename the partial file until
+			 * the summarizer finishes with it, else it will fail.
+			 */
+			if (summarize_wal)
+				WaitForWalSummarization(EndOfLog);
+
 			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
 			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
 			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
@@ -5946,8 +5958,10 @@ StartupXLOG(void)
 	}
 
 	/* Save the selected TimeLineID in shared memory, too */
+	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->InsertTimeLineID = newTLI;
 	XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
+	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
 	 * Actually, if WAL ended in an incomplete record, skip the parts that
@@ -6482,6 +6496,25 @@ GetWALInsertionTimeLine(void)
 	return XLogCtl->InsertTimeLineID;
 }
 
+/*
+ * GetWALInsertionTimeLineIfSet -- If the system is not in recovery, returns
+ * the WAL insertion timeline; else, returns 0. Wherever possible, use
+ * GetWALInsertionTimeLine() instead, since it's cheaper. Note that this
+ * function decides recovery has ended as soon as the insert TLI is set, which
+ * happens before we set XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE.
+ */
+TimeLineID
+GetWALInsertionTimeLineIfSet(void)
+{
+	TimeLineID	insertTLI;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	insertTLI = XLogCtl->InsertTimeLineID;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return insertTLI;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 2108702397..6cfce55365 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -277,12 +277,6 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
 	TimeLineID	earliest_wal_range_tli = 0;
 	XLogRecPtr	earliest_wal_range_start_lsn = InvalidXLogRecPtr;
 	TimeLineID	latest_wal_range_tli = 0;
-	XLogRecPtr	summarized_lsn;
-	XLogRecPtr	pending_lsn;
-	XLogRecPtr	prior_pending_lsn = InvalidXLogRecPtr;
-	int			deadcycles = 0;
-	TimestampTz initial_time,
-				current_time;
 
 	Assert(ib->buf.data == NULL);
 
@@ -457,85 +451,13 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
 	}
 
 	/*
-	 * Wait for WAL summarization to catch up to the backup start LSN (but
-	 * time out if it doesn't do so quickly enough).
+	 * Wait for WAL summarization to catch up to the backup start LSN. This
+	 * will throw an error if the WAL summarizer appears to be stuck. If WAL
+	 * summarization gets disabled while we're waiting, this will return
+	 * immediately, and we'll error out further down if the WAL summaries are
+	 * incomplete.
 	 */
-	initial_time = current_time = GetCurrentTimestamp();
-	while (1)
-	{
-		long		timeout_in_ms = 10000;
-		long		elapsed_seconds;
-
-		/*
-		 * Align the wait time to prevent drift. This doesn't really matter,
-		 * but we'd like the warnings about how long we've been waiting to say
-		 * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
-		 * drifting to something that is not a multiple of ten.
-		 */
-		timeout_in_ms -=
-			TimestampDifferenceMilliseconds(initial_time, current_time) %
-			timeout_in_ms;
-
-		/* Wait for up to 10 seconds. */
-		summarized_lsn = WaitForWalSummarization(backup_state->startpoint,
-												 timeout_in_ms, &pending_lsn);
-
-		/* If WAL summarization has progressed sufficiently, stop waiting. */
-		if (summarized_lsn >= backup_state->startpoint)
-			break;
-
-		/*
-		 * Keep track of the number of cycles during which there has been no
-		 * progression of pending_lsn. If pending_lsn is not advancing, that
-		 * means that not only are no new files appearing on disk, but we're
-		 * not even incorporating new records into the in-memory state.
-		 */
-		if (pending_lsn > prior_pending_lsn)
-		{
-			prior_pending_lsn = pending_lsn;
-			deadcycles = 0;
-		}
-		else
-			++deadcycles;
-
-		/*
-		 * If we've managed to wait for an entire minute without the WAL
-		 * summarizer absorbing a single WAL record, error out; probably
-		 * something is wrong.
-		 *
-		 * We could consider also erroring out if the summarizer is taking too
-		 * long to catch up, but it's not clear what rate of progress would be
-		 * acceptable and what would be too slow. So instead, we just try to
-		 * error out in the case where there's no progress at all. That seems
-		 * likely to catch a reasonable number of the things that can go wrong
-		 * in practice (e.g. the summarizer process is completely hung, say
-		 * because somebody hooked up a debugger to it or something) without
-		 * giving up too quickly when the system is just slow.
-		 */
-		if (deadcycles >= 6)
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("WAL summarization is not progressing"),
-					 errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
-							   LSN_FORMAT_ARGS(backup_state->startpoint),
-							   LSN_FORMAT_ARGS(summarized_lsn),
-							   LSN_FORMAT_ARGS(pending_lsn))));
-
-		/*
-		 * Otherwise, just let the user know what's happening.
-		 */
-		current_time = GetCurrentTimestamp();
-		elapsed_seconds =
-			TimestampDifferenceMilliseconds(initial_time, current_time) / 1000;
-		ereport(WARNING,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("still waiting for WAL summarization through %X/%X after %ld seconds",
-						LSN_FORMAT_ARGS(backup_state->startpoint),
-						elapsed_seconds),
-				 errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
-						   LSN_FORMAT_ARGS(summarized_lsn),
-						   LSN_FORMAT_ARGS(pending_lsn))));
-	}
+	WaitForWalSummarization(backup_state->startpoint);
 
 	/*
 	 * Retrieve a list of all WAL summaries on any timeline that overlap with
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index c1dd75b98c..508561e194 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -650,54 +650,132 @@ SetWalSummarizerLatch(void)
 }
 
 /*
- * Wait until WAL summarization reaches the given LSN, but not longer than
- * the given timeout.
+ * Wait until WAL summarization reaches the given LSN, but time out with an
+ * error if the summarizer seems to be stick.
  *
- * The return value is the first still-unsummarized LSN. If it's greater than
- * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
- *
- * Either way, *pending_lsn is set to the value taken from WalSummarizerCtl.
+ * Returns immediately if summarize_wal is turned off while we wait. Caller
+ * is expected to handle this case, if necessary.
  */
-XLogRecPtr
-WaitForWalSummarization(XLogRecPtr lsn, long timeout, XLogRecPtr *pending_lsn)
+void
+WaitForWalSummarization(XLogRecPtr lsn)
 {
-	TimestampTz start_time = GetCurrentTimestamp();
-	TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
-	XLogRecPtr	summarized_lsn;
+	TimestampTz initial_time,
+				cycle_time,
+				current_time;
+	XLogRecPtr	prior_pending_lsn = InvalidXLogRecPtr;
+	int			deadcycles = 0;
 
-	Assert(!XLogRecPtrIsInvalid(lsn));
-	Assert(timeout > 0);
+	initial_time = cycle_time = GetCurrentTimestamp();
 
 	while (1)
 	{
-		TimestampTz now;
-		long		remaining_timeout;
+		long		timeout_in_ms = 10000;
+		XLogRecPtr	summarized_lsn;
+		XLogRecPtr	pending_lsn;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* If WAL summarization is disabled while we're waiting, give up. */
+		if (!summarize_wal)
+			return;
 
 		/*
 		 * If the LSN summarized on disk has reached the target value, stop.
 		 */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
 		summarized_lsn = WalSummarizerCtl->summarized_lsn;
-		*pending_lsn = WalSummarizerCtl->pending_lsn;
+		pending_lsn = WalSummarizerCtl->pending_lsn;
 		LWLockRelease(WALSummarizerLock);
+
+		/* If WAL summarization has progressed sufficiently, stop waiting. */
 		if (summarized_lsn >= lsn)
 			break;
 
-		/* Timeout reached? If yes, stop. */
-		now = GetCurrentTimestamp();
-		remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
-		if (remaining_timeout <= 0)
-			break;
+		/* Recheck current time. */
+		current_time = GetCurrentTimestamp();
+
+		/* Have we finished the current cycle of waiting? */
+		if (TimestampDifferenceMilliseconds(cycle_time,
+											current_time) >= timeout_in_ms)
+		{
+			long		elapsed_seconds;
+
+			/* Begin new wait cycle. */
+			cycle_time = TimestampTzPlusMilliseconds(cycle_time,
+													 timeout_in_ms);
+
+			/*
+			 * Keep track of the number of cycles during which there has been
+			 * no progression of pending_lsn. If pending_lsn is not advancing,
+			 * that means that not only are no new files appearing on disk, but
+			 * we're not even incorporating new records into the in-memory
+			 * state.
+			 */
+			if (pending_lsn > prior_pending_lsn)
+			{
+				prior_pending_lsn = pending_lsn;
+				deadcycles = 0;
+			}
+			else
+				++deadcycles;
+
+			/*
+			 * If we've managed to wait for an entire minute without the WAL
+			 * summarizer absorbing a single WAL record, error out; probably
+			 * something is wrong.
+			 *
+			 * We could consider also erroring out if the summarizer is taking
+			 * too long to catch up, but it's not clear what rate of progress
+			 * would be acceptable and what would be too slow. So instead, we
+			 * just try to error out in the case where there's no progress at
+			 * all. That seems likely to catch a reasonable number of the
+			 * things that can go wrong in practice (e.g. the summarizer
+			 * process is completely hung, say because somebody hooked up a
+			 * debugger to it or something) without giving up too quickly when
+			 * the system is just slow.
+			 */
+			if (deadcycles >= 6)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("WAL summarization is not progressing"),
+						 errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
+								   LSN_FORMAT_ARGS(lsn),
+								   LSN_FORMAT_ARGS(summarized_lsn),
+								   LSN_FORMAT_ARGS(pending_lsn))));
+
+
+			/*
+			 * Otherwise, just let the user know what's happening.
+			 */
+			elapsed_seconds =
+				TimestampDifferenceMilliseconds(initial_time,
+												current_time) / 1000;
+			ereport(WARNING,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("still waiting for WAL summarization through %X/%X after %ld seconds",
+							LSN_FORMAT_ARGS(lsn),
+							elapsed_seconds),
+					 errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
+							   LSN_FORMAT_ARGS(summarized_lsn),
+							   LSN_FORMAT_ARGS(pending_lsn))));
+		}
+
+		/*
+		 * Align the wait time to prevent drift. This doesn't really matter,
+		 * but we'd like the warnings about how long we've been waiting to say
+		 * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
+		 * drifting to something that is not a multiple of ten.
+		 */
+		timeout_in_ms -=
+			TimestampDifferenceMilliseconds(cycle_time, current_time);
 
 		/* Wait and see. */
 		ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
-									remaining_timeout,
+									timeout_in_ms,
 									WAIT_EVENT_WAL_SUMMARY_READY);
 	}
 
 	ConditionVariableCancelSleep();
-
-	return summarized_lsn;
 }
 
 /*
@@ -730,6 +808,22 @@ GetLatestLSN(TimeLineID *tli)
 		TimeLineID	flush_tli;
 		XLogRecPtr	replay_lsn;
 		TimeLineID	replay_tli;
+		TimeLineID	insert_tli;
+
+		/*
+		 * After the insert TLI has been set and before the control file has
+		 * been updated to show the DB in production, RecoveryInProgress()
+		 * will return true, because it's not yet safe for all backends to
+		 * begin writing WAL. However, replay has already ceased, so from our
+		 * point of view, recovery is already over. We should summarize up to
+		 * where replay stopped and then prepare to resume at the start of the
+		 * insert timeline.
+		 */
+		if ((insert_tli = GetWALInsertionTimeLineIfSet()) != 0)
+		{
+			*tli = insert_tli;
+			return GetXLogReplayRecPtr(NULL);
+		}
 
 		/*
 		 * What we really want to know is how much WAL has been flushed to
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
index d871b2e3b8..d142608e94 100644
--- a/src/bin/pg_combinebackup/meson.build
+++ b/src/bin/pg_combinebackup/meson.build
@@ -35,6 +35,7 @@ tests += {
       't/005_integrity.pl',
       't/006_db_file_copy.pl',
       't/007_wal_level_minimal.pl',
+      't/008_promote.pl',
     ],
   }
 }
diff --git a/src/bin/pg_combinebackup/t/008_promote.pl b/src/bin/pg_combinebackup/t/008_promote.pl
new file mode 100644
index 0000000000..1154a5d8b2
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/008_promote.pl
@@ -0,0 +1,81 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+#
+# Test whether WAL summaries are complete such that incremental backup
+# can be performed after promoting a standby at an arbitrary LSN.
+
+use strict;
+use warnings FATAL => 'all';
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_COMBINEBACKUP_MODE} || '--copy';
+
+note "testing using mode $mode";
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->append_conf('postgresql.conf', 'log_min_messages = debug1');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'avocado');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+	[ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+	"full backup from node1");
+
+# Checkpoint and record LSN after.
+$node1->safe_psql('postgres', 'CHECKPOINT');
+my $lsn = $node1->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now create a second node. We want this to stream from the first node and
+# then stop recovery at some arbitrary LSN, not just when it hits the end of
+# WAL, so use a recovery target.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup1', 'has_streaming' => 1);
+$node2->append_conf('postgresql.conf', <<EOM);
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'pause'
+EOM
+$node2->start();
+
+# Wait until recoveery pauses, then promote.
+$node2->poll_query_until('postgres', "SELECT pg_get_wal_replay_pause_state() = 'paused';");
+$node2->safe_psql('postgres', "SELECT pg_promote()");
+
+# Once promotion occurs, insert a second row on the new node.
+$node2->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';");
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'blackberry');
+EOM
+
+# Now take an incremental backup. If WAL summarization didn't follow the
+# timeline cange correctly, something should break at this point.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node2->command_ok(
+	[ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+	  '--incremental', $backup1path . '/backup_manifest' ],
+	"incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup2',
+						 combine_with_prior => [ 'backup1' ]);
+$node3->start();
+
+done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1a1f11a943..2c507ea618 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -248,6 +248,7 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
+extern TimeLineID GetWALInsertionTimeLineIfSet(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
index 112bc1e6cb..aedca55676 100644
--- a/src/include/postmaster/walsummarizer.h
+++ b/src/include/postmaster/walsummarizer.h
@@ -30,7 +30,6 @@ extern void GetWalSummarizerState(TimeLineID *summarized_tli,
 extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
 										   bool *lsn_is_exact);
 extern void SetWalSummarizerLatch(void);
-extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
-										  XLogRecPtr *pending_lsn);
+extern void WaitForWalSummarization(XLogRecPtr lsn);
 
 #endif
-- 
2.39.3 (Apple Git-145)

v3-0001-Allow-WAL-summarization-to-back-up-when-timeline-.patchapplication/octet-stream; name=v3-0001-Allow-WAL-summarization-to-back-up-when-timeline-.patchDownload

From c0870a8cf47044ed67a9a90933ee1077006ec3a0 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 10 Jul 2024 16:56:14 -0400
Subject: [PATCH v3 1/2] Allow WAL summarization to back up when timeline
 changes.

The old code believed that it was not possible to switch timelines
without first replaying all of the WAL from the old timeline, but
that turns out to be false, as demonstrated by an example from Fujii
Masao. As a result, it assumed that summarization would always
continue from the LSN where summarization previously ended. But in
fact, when a timeline switch occurs without replaying all the WAL
from the previous timeline, we can need to back up to an earlier
LSN. Adjust accordingly.
---
 src/backend/postmaster/walsummarizer.c | 32 +++++++++++++++-----------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 191b360bef..c1dd75b98c 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -388,13 +388,27 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 
 		/*
 		 * If we've reached the switch LSN, we can't summarize anything else
-		 * on this timeline. Switch to the next timeline and go around again.
+		 * on this timeline. Switch to the next timeline and go around again,
+		 * backing up to the exact switch point if we passed it.
 		 */
 		if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
 		{
+			/* Restart summarization from switch point. */
 			current_tli = switch_tli;
+			current_lsn = switch_lsn;
+
+			/* Next timeline and switch point, if any, not yet known. */
 			switch_lsn = InvalidXLogRecPtr;
 			switch_tli = 0;
+
+			/* Update (really, rewind, if needed) state in shared memory. */
+			LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+			WalSummarizerCtl->summarized_lsn = current_lsn;
+			WalSummarizerCtl->summarized_tli = current_tli;
+			WalSummarizerCtl->lsn_is_exact = true;
+			WalSummarizerCtl->pending_lsn = current_lsn;
+			LWLockRelease(WALSummarizerLock);
+
 			continue;
 		}
 
@@ -415,7 +429,6 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 
 		/* Update state in shared memory. */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
-		Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
 		WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
 		WalSummarizerCtl->summarized_tli = current_tli;
 		WalSummarizerCtl->lsn_is_exact = true;
@@ -1060,7 +1073,6 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 
 		/* Also update shared memory. */
 		LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
-		Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
 		Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
 		WalSummarizerCtl->pending_lsn = summary_end_lsn;
 		LWLockRelease(WALSummarizerLock);
@@ -1460,17 +1472,11 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 					 * Allow reads up to exactly the switch point.
 					 *
 					 * It's possible that this will cause read_upto to move
-					 * backwards, because walreceiver might have read a
-					 * partial record and flushed it to disk, and we'd view
-					 * that data as safe to read. However, the
-					 * XLOG_END_OF_RECOVERY record will be written at the end
-					 * of the last complete WAL record, not at the end of the
-					 * WAL that we've flushed to disk.
-					 *
-					 * So switchpoint < private->read_upto is possible here,
-					 * but switchpoint < state->EndRecPtr should not be.
+					 * backwards, because we might have been promoted before
+					 * reaching the end of the previous timeline. In that case,
+					 * the next loop iteration will likely conclude that we've
+					 * reached end of WAL.
 					 */
-					Assert(switchpoint >= state->EndRecPtr);
 					private_data->read_upto = switchpoint;
 
 					/* Debugging output. */
-- 
2.39.3 (Apple Git-145)

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Robert Haas (#6)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Mon, Jul 22, 2024 at 11:02 AM Robert Haas <robertmhaas@gmail.com> wrote:

I'm still hoping for some review/feedback/testing of these before I
commit them, but I also don't want to wait too long.

Hearing nothing, pushed 0001.

--
Robert Haas
EDB: http://www.enterprisedb.com

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Robert Haas (#7)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Fri, Jul 26, 2024 at 10:09:22AM -0400, Robert Haas wrote:

On Mon, Jul 22, 2024 at 11:02 AM Robert Haas <robertmhaas@gmail.com> wrote:

I'm still hoping for some review/feedback/testing of these before I
commit them, but I also don't want to wait too long.

Hearing nothing, pushed 0001.

nitpick: I think this one needs a pgindent.

--
nathan

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Nathan Bossart (#8)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Fri, Jul 26, 2024 at 11:51 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

nitpick: I think this one needs a pgindent.

Ugh, sorry. I thought of that while I was working on the commit but
then I messed up some other aspect of it and this went out of my head.

Fixed now, I hope.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Alexander Korotkov

aekorotkov@gmail.com

over 1 year ago

In reply to: Robert Haas (#9)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Fri, Jul 26, 2024 at 7:02 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 26, 2024 at 11:51 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

nitpick: I think this one needs a pgindent.

Ugh, sorry. I thought of that while I was working on the commit but
then I messed up some other aspect of it and this went out of my head.

Fixed now, I hope.

0002 could also use pg_indent and pgperltidy. And I have couple other
notes regarding 0002.

In the process of fixing these bugs, I realized that the logic to wait
for WAL summarization to catch up was spread out in a way that made
it difficult to reuse properly, so this code refactors things to make
it easier.

It would be nice to split refactoring out of material logic changed.
This way it would be easier to review and would look cleaner in the
git history.

To make this fix work, also teach the WAL summarizer that after a
promotion has occurred, no more WAL can appear on the previous
timeline: previously, the WAL summarizer wouldn't switch to the new
timeline until we actually started writing WAL there, but that meant
that when the startup process was waiting for the WAL summarizer, it
was waiting for an action that the summarizer wasn't yet prepared to
take.

I think this is a pretty long sentence, and I'm not sure I can
understand it correctly. Does startup process wait for the WAL
summarizer without this patch? If not, it's not clear for me that
word "previously" doesn't distribute to this part of sentence.
Breaking this into multiple sentences could improve the clarity for
me.

------
Regards,
Alexander Korotkov
Supabase

#11

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Alexander Korotkov (#10)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Fri, Jul 26, 2024 at 4:10 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

0002 could also use pg_indent and pgperltidy. And I have couple other
notes regarding 0002.

As Tom said elsewhere, we don't have a practice of requiring perltidy
for every commit, and I normally don't bother. The tree is
pgindent-clean currently so I believe I got that part right.

In the process of fixing these bugs, I realized that the logic to wait
for WAL summarization to catch up was spread out in a way that made
it difficult to reuse properly, so this code refactors things to make
it easier.

It would be nice to split refactoring out of material logic changed.
This way it would be easier to review and would look cleaner in the
git history.

I support that idea in general but felt it was overkill in this case:
it's new code, and there was only one existing caller of the function
that got refactored, and I'm not a huge fan of cluttering the git
history with a bunch of tiny little refactoring commits to fix a
single bug. I might have changed it if I'd seen this note before
committing, though.

To make this fix work, also teach the WAL summarizer that after a
promotion has occurred, no more WAL can appear on the previous
timeline: previously, the WAL summarizer wouldn't switch to the new
timeline until we actually started writing WAL there, but that meant
that when the startup process was waiting for the WAL summarizer, it
was waiting for an action that the summarizer wasn't yet prepared to
take.

I think this is a pretty long sentence, and I'm not sure I can
understand it correctly. Does startup process wait for the WAL
summarizer without this patch? If not, it's not clear for me that
word "previously" doesn't distribute to this part of sentence.
Breaking this into multiple sentences could improve the clarity for
me.

Yes, I think that phrasing is muddled. It's too late to amend the
commit message, but for posterity: I initially thought that I could
fix this bug by just teaching the startup process to wait for WAL
summarization before performing the .partial renaming, but that was
not enough by itself. The problem is that the WAL summarizer would
read all the records that were present in the final WAL file on the
old timeline, but it wouldn't actually write out a summary file,
because that only happens when it reaches XLOG_CHECKPOINT_REDO or a
timeline switch point. Since no WAL had appeared on the new timeline
yet, it didn't view the end of the old timeline as a switch point, so
it just sat there waiting, without ever writing out a file; and the
startup process sat there waiting for it. So the second part of the
fix is to make the WAL summarizer realize that once the startup
process has chosen a new timeline ID, no more WAL is going to appear
on the old timeline, and thus it can (and should) write out the final
summary file for the old timeline and prepare to read WAL from the new
timeline.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12

Alexander Korotkov

aekorotkov@gmail.com

over 1 year ago

In reply to: Robert Haas (#11)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Mon, Jul 29, 2024 at 7:20 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 26, 2024 at 4:10 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

0002 could also use pg_indent and pgperltidy. And I have couple other
notes regarding 0002.

As Tom said elsewhere, we don't have a practice of requiring perltidy
for every commit, and I normally don't bother. The tree is
pgindent-clean currently so I believe I got that part right.

Got it, thanks.

In the process of fixing these bugs, I realized that the logic to wait
for WAL summarization to catch up was spread out in a way that made
it difficult to reuse properly, so this code refactors things to make
it easier.

It would be nice to split refactoring out of material logic changed.
This way it would be easier to review and would look cleaner in the
git history.

I support that idea in general but felt it was overkill in this case:
it's new code, and there was only one existing caller of the function
that got refactored, and I'm not a huge fan of cluttering the git
history with a bunch of tiny little refactoring commits to fix a
single bug. I might have changed it if I'd seen this note before
committing, though.

I understand your point. I'm also not huge fan of a flood of small
commits. Nevertheless, I find splitting refactoring from other
changes generally useful. That could be a single commit of many small
refactorings, not many small commits. The point for me is easier
review: you can expect refactoring commit to contain "isomorphic"
changes, while other commits implementing material logic changes. But
that might be a committer preference though.

To make this fix work, also teach the WAL summarizer that after a
promotion has occurred, no more WAL can appear on the previous
timeline: previously, the WAL summarizer wouldn't switch to the new
timeline until we actually started writing WAL there, but that meant
that when the startup process was waiting for the WAL summarizer, it
was waiting for an action that the summarizer wasn't yet prepared to
take.

I think this is a pretty long sentence, and I'm not sure I can
understand it correctly. Does startup process wait for the WAL
summarizer without this patch? If not, it's not clear for me that
word "previously" doesn't distribute to this part of sentence.
Breaking this into multiple sentences could improve the clarity for
me.

Yes, I think that phrasing is muddled. It's too late to amend the
commit message, but for posterity: I initially thought that I could
fix this bug by just teaching the startup process to wait for WAL
summarization before performing the .partial renaming, but that was
not enough by itself. The problem is that the WAL summarizer would
read all the records that were present in the final WAL file on the
old timeline, but it wouldn't actually write out a summary file,
because that only happens when it reaches XLOG_CHECKPOINT_REDO or a
timeline switch point. Since no WAL had appeared on the new timeline
yet, it didn't view the end of the old timeline as a switch point, so
it just sat there waiting, without ever writing out a file; and the
startup process sat there waiting for it. So the second part of the
fix is to make the WAL summarizer realize that once the startup
process has chosen a new timeline ID, no more WAL is going to appear
on the old timeline, and thus it can (and should) write out the final
summary file for the old timeline and prepare to read WAL from the new
timeline.

Great, thank you for the explanation. Now that's clear.

------
Regards,
Alexander Korotkov
Supabase

#13

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Alexander Korotkov (#12)

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Wed, Jul 31, 2024 at 04:49:54PM +0300, Alexander Korotkov wrote:

On Mon, Jul 29, 2024 at 7:20 PM Robert Haas <robertmhaas@gmail.com> wrote:

I support that idea in general but felt it was overkill in this case:
it's new code, and there was only one existing caller of the function
that got refactored, and I'm not a huge fan of cluttering the git
history with a bunch of tiny little refactoring commits to fix a
single bug. I might have changed it if I'd seen this note before
committing, though.

I understand your point. I'm also not huge fan of a flood of small
commits. Nevertheless, I find splitting refactoring from other
changes generally useful. That could be a single commit of many small
refactorings, not many small commits. The point for me is easier
review: you can expect refactoring commit to contain "isomorphic"
changes, while other commits implementing material logic changes.

For review, it also tends to matter a lot to me, especially if the
same areas of code are changed across multiple commits. That's more
annoying for authors as the splits are annoying to maintain. For a
single caller introduced, what Robert has done is fine IMO.

But that might be a committer preference though.

I tend to prefer refactorings if it comes to a cleaner git history,
still that's always case-by-case, and all of us have our own habits.
--
Michael