Apparent bug in WAL summarizer process (hung state)

Started by Israel Barth Rubioover 1 year ago7 messages
#1Israel Barth Rubio
barthisrael@gmail.com

Hello,

Hope you are doing well.

I've been playing a bit with the incremental backup feature which might
come as
part of the 17 release, and I think I hit a possible bug in the WAL
summarizer
process.

The issue that I face refers to the summarizer process getting into a hung
state.
When the issue is triggered, it keeps in an infinite loop trying to process
a WAL
file that no longer exists. It apparently comes up only when I perform
changes to
`wal_summarize` GUC and reload Postgres, while there is some load in
Postgres
which makes it recycle WAL files.

I'm running Postgres 17 in a Rockylinux 9 VM. In order to have less WAL
files
available in `pg_wal` and make it easier to reproduce the issue, I'm using
a low
value for `max_wal_size` ('100MB'). You can find below the steps that I
took to
reproduce this problem, assuming this small `max_wal_size`, and
`summarize_wal`
initially enabled:

```bash
# Assume we initially have max_wal_size = '100MB' and summarize_wal = on

# Create a table of ~ 100MB
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_1

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental
full_backup_1/backup_manifest

# Disable summarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO off"
psql -c "SELECT pg_reload_conf()"

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Re-enable sumarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO on"
psql -c "SELECT pg_reload_conf()"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_2

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental
full_backup_2/backup_manifest
```

I'm able to reproduce the issue most of the time when running these steps
manually. It's harder to reproduce if I attempt to run those commands as a
bash script.

This is the sample output of a run of those commands:

```console

(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS
SELECT generate_series(1, 3000000)"SELECT 3000000(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
full_backup_1NOTICE: WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331785/331785 kB (100%), 1/1 tablespace(barman)
[postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 3000000)"SELECT 3000000(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_1 --incremental
full_backup_1/backup_manifestNOTICE: WAL archiving is not enabled;
you must ensure that all required WAL segments are copied through
other means to complete the backup111263/331720 kB (33%), 1/1
tablespace(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM
SET summarize_wal TO off"ALTER SYSTEM(barman) [postgres@barmandevhost
~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf----------------
t(1 row)
(barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 3000000)"SELECT 3000000(barman)
[postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO
on"ALTER SYSTEM(barman) [postgres@barmandevhost ~]$ psql -c "SELECT
pg_reload_conf()" pg_reload_conf---------------- t(1 row)
(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P
-D full_backup_2NOTICE: WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331734/331734 kB (100%), 1/1 tablespace(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_2 --incremental
full_backup_2/backup_manifestWARNING: still waiting for WAL
summarization through 2/C1000028 after 10 secondsDETAIL:
Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in
memory.WARNING: still waiting for WAL summarization through
2/C1000028 after 20 secondsDETAIL: Summarization has reached
2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting
for WAL summarization through 2/C1000028 after 30 secondsDETAIL:
Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in
memory.WARNING: still waiting for WAL summarization through
2/C1000028 after 40 secondsDETAIL: Summarization has reached
2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting
for WAL summarization through 2/C1000028 after 50 secondsDETAIL:
Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in
memory.WARNING: still waiting for WAL summarization through
2/C1000028 after 60 secondsDETAIL: Summarization has reached
2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: aborting backup
due to backend exiting before pg_backup_stop was calledpg_basebackup:
error: could not initiate base backup: ERROR: WAL summarization is
not progressingDETAIL: Summarization is needed through 2/C1000028,
but is stuck at 2/B30000D8 on disk and 2/B30000D8 in
memory.pg_basebackup: removing data directory "incremental_backup_2"

```

I took an `ls` output from `pg_wal` as well as `strace` and `gdb` from the
WAL
summarizer process. I'm attaching that to this email hoping that can help
somehow.

FWIW, once I restart Postgres the WAL summarizer process gets back to normal
functioning. It seems to me there is some race condition between when a WAL
file
is removed and when `summarize_wal` is re-enabled, causing the process to
keep
looking for a WAL file that is the past.

Best regards,
Israel.

#2Israel Barth Rubio
barthisrael@gmail.com
In reply to: Israel Barth Rubio (#1)
3 attachment(s)
Re: Apparent bug in WAL summarizer process (hung state)

I'm attaching the files which I missed in the original email.

Show quoted text

Attachments:

strace.txttext/plain; charset=US-ASCII; name=strace.txtDownload
gdb.txttext/plain; charset=US-ASCII; name=gdb.txtDownload
ls.txttext/plain; charset=US-ASCII; name=ls.txtDownload
#3Michael Paquier
michael@paquier.xyz
In reply to: Israel Barth Rubio (#1)
Re: Apparent bug in WAL summarizer process (hung state)

On Mon, Jun 24, 2024 at 02:56:00PM -0300, Israel Barth Rubio wrote:

I've been playing a bit with the incremental backup feature which might
come as
part of the 17 release, and I think I hit a possible bug in the WAL
summarizer
process.

Thanks for testing new features and for this report!

FWIW, once I restart Postgres the WAL summarizer process gets back to normal
functioning. It seems to me there is some race condition between when a WAL
file
is removed and when `summarize_wal` is re-enabled, causing the process to
keep
looking for a WAL file that is the past.

I am adding an open item to track this issue, to make sure that this
is looked at.
--
Michael

#4Robert Haas
robertmhaas@gmail.com
In reply to: Israel Barth Rubio (#1)
1 attachment(s)
Re: Apparent bug in WAL summarizer process (hung state)

On Mon, Jun 24, 2024 at 1:56 PM Israel Barth Rubio
<barthisrael@gmail.com> wrote:

I've been playing a bit with the incremental backup feature which might come as
part of the 17 release, and I think I hit a possible bug in the WAL summarizer
process.

The issue that I face refers to the summarizer process getting into a hung state.
When the issue is triggered, it keeps in an infinite loop trying to process a WAL
file that no longer exists. It apparently comes up only when I perform changes to
`wal_summarize` GUC and reload Postgres, while there is some load in Postgres
which makes it recycle WAL files.

Yeah, this is a bug. It seems that the WAL summarizer process, when
restarted, wants to restart from wherever it was previously
summarizing WAL, which is correct if that WAL is still around, but if
summarize_wal has been turned off in the meanwhile, it might not be
correct. Here's a patch to fix that.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Prevent-summarizer-hang-when-summarize_wal-turn-o.patchapplication/octet-stream; name=v1-0001-Prevent-summarizer-hang-when-summarize_wal-turn-o.patchDownload
From e84b7458c7e0258ee75b07cf288825e07c016378 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 25 Jun 2024 15:42:36 -0400
Subject: [PATCH v1] Prevent summarizer hang when summarize_wal turn off and
 back on.

Before this commit, when the WAL summarizer started up or recovered
from an error, it would resume summarization from wherever it left
off. That was OK normally, but wrong if summarize_wal=off had been
turned off temporary, allowing some WAL to be removed, and then turned
back on again. In such cases, the WAL summarizer would simply hang
forever. This commit changes the reinitialization sequence for WAL
summarizer to rederive the starting position in the way we were
already doing at initial startup, fixing the problem.

Per report from Israel Barth Rubio.
---
 src/backend/access/transam/xlog.c      |  2 +-
 src/backend/postmaster/walsummarizer.c | 80 +++++++++++++-------------
 src/include/postmaster/walsummarizer.h |  3 +-
 3 files changed, 42 insertions(+), 43 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..a9784a4e10 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7906,7 +7906,7 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 	 * If WAL summarization is in use, don't remove WAL that has yet to be
 	 * summarized.
 	 */
-	keep = GetOldestUnsummarizedLSN(NULL, NULL, false);
+	keep = GetOldestUnsummarizedLSN(NULL, NULL);
 	if (keep != InvalidXLogRecPtr)
 	{
 		XLogSegNo	unsummarized_segno;
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 72f6c04478..a387d97d3a 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -337,7 +337,7 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 	 *
 	 * If we discover that WAL summarization is not enabled, just exit.
 	 */
-	current_lsn = GetOldestUnsummarizedLSN(&current_tli, &exact, true);
+	current_lsn = GetOldestUnsummarizedLSN(&current_tli, &exact);
 	if (XLogRecPtrIsInvalid(current_lsn))
 		proc_exit(0);
 
@@ -479,23 +479,18 @@ GetWalSummarizerState(TimeLineID *summarized_tli, XLogRecPtr *summarized_lsn,
 
 /*
  * Get the oldest LSN in this server's timeline history that has not yet been
- * summarized.
+ * summarized, and update shared memory state as appropriate.
  *
  * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
  *
  * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
  * necessarily the start of a WAL record and false if it's just the beginning
  * of a WAL segment.
- *
- * If reset_pending_lsn is true, resets the pending_lsn in shared memory to
- * be equal to the summarized_lsn.
  */
 XLogRecPtr
-GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
-						 bool reset_pending_lsn)
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
 {
 	TimeLineID	latest_tli;
-	LWLockMode	mode = reset_pending_lsn ? LW_EXCLUSIVE : LW_SHARED;
 	int			n;
 	List	   *tles;
 	XLogRecPtr	unsummarized_lsn = InvalidXLogRecPtr;
@@ -503,22 +498,21 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 	bool		should_make_exact = false;
 	List	   *existing_summaries;
 	ListCell   *lc;
+	bool		am_wal_summarizer = AmWalSummarizerProcess();
 
 	/* If not summarizing WAL, do nothing. */
 	if (!summarize_wal)
 		return InvalidXLogRecPtr;
 
 	/*
-	 * Unless we need to reset the pending_lsn, we initially acquire the lock
-	 * in shared mode and try to fetch the required information. If we acquire
-	 * in shared mode and find that the data structure hasn't been
-	 * initialized, we reacquire the lock in exclusive mode so that we can
-	 * initialize it. However, if someone else does that first before we get
-	 * the lock, then we can just return the requested information after all.
+	 * If we are not the WAL summarizer process, then we normally just want
+	 * to read the values from shared memory. However, as an exception, if
+	 * shared memory hasn't been initialized yet, then we need to do that so
+	 * that we can read legal values and not remove any WAL too early.
 	 */
-	while (1)
+	if (!am_wal_summarizer)
 	{
-		LWLockAcquire(WALSummarizerLock, mode);
+		LWLockAcquire(WALSummarizerLock, LW_SHARED);
 
 		if (WalSummarizerCtl->initialized)
 		{
@@ -527,27 +521,22 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 				*tli = WalSummarizerCtl->summarized_tli;
 			if (lsn_is_exact != NULL)
 				*lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
-			if (reset_pending_lsn)
-				WalSummarizerCtl->pending_lsn =
-					WalSummarizerCtl->summarized_lsn;
 			LWLockRelease(WALSummarizerLock);
 			return unsummarized_lsn;
 		}
 
-		if (mode == LW_EXCLUSIVE)
-			break;
-
 		LWLockRelease(WALSummarizerLock);
-		mode = LW_EXCLUSIVE;
 	}
 
 	/*
-	 * The data structure needs to be initialized, and we are the first to
-	 * obtain the lock in exclusive mode, so it's our job to do that
-	 * initialization.
+	 * Find the oldest timeline on which WAL still exists, and the earliest
+	 * segment for which it exists.
 	 *
-	 * So, find the oldest timeline on which WAL still exists, and the
-	 * earliest segment for which it exists.
+	 * Note that we do this every time the WAL summarizer process restarts
+	 * or recovers from an error, in case the contents of pg_wal have changed
+	 * under us e.g. if some files were removed, either manually - which
+	 * shouldn't really happen, but might - or by postgres itself, if
+	 * summarize_wal was turned off and then back on again.
 	 */
 	(void) GetLatestLSN(&latest_tli);
 	tles = readTimeLineHistory(latest_tli);
@@ -568,12 +557,6 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 		}
 	}
 
-	/* It really should not be possible for us to find no WAL. */
-	if (unsummarized_tli == 0)
-		ereport(ERROR,
-				errcode(ERRCODE_INTERNAL_ERROR),
-				errmsg_internal("no WAL found on timeline %u", latest_tli));
-
 	/*
 	 * Don't try to summarize anything older than the end LSN of the newest
 	 * summary file that exists for this timeline.
@@ -592,12 +575,29 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 		}
 	}
 
-	/* Update shared memory with the discovered values. */
-	WalSummarizerCtl->initialized = true;
-	WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
-	WalSummarizerCtl->summarized_tli = unsummarized_tli;
-	WalSummarizerCtl->lsn_is_exact = should_make_exact;
-	WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+	/* It really should not be possible for us to find no WAL. */
+	if (unsummarized_tli == 0)
+		ereport(ERROR,
+				errcode(ERRCODE_INTERNAL_ERROR),
+				errmsg_internal("no WAL found on timeline %u", latest_tli));
+
+	/*
+	 * If we're the WAL summarizer, we always want to store the values we
+	 * just computed into shared memory, because those are the values we're
+	 * going to use to drive our operation, and so they are the authoritative
+	 * values. Otherwise, we only store values into shared memory if they are
+	 */
+	LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+	if (am_wal_summarizer|| !WalSummarizerCtl->initialized)
+	{
+		WalSummarizerCtl->initialized = true;
+		WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+		WalSummarizerCtl->summarized_tli = unsummarized_tli;
+		WalSummarizerCtl->lsn_is_exact = should_make_exact;
+		WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+	}
+	else
+		unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
 
 	/* Also return the to the caller as required. */
 	if (tli != NULL)
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
index ad346d0c11..112bc1e6cb 100644
--- a/src/include/postmaster/walsummarizer.h
+++ b/src/include/postmaster/walsummarizer.h
@@ -28,8 +28,7 @@ extern void GetWalSummarizerState(TimeLineID *summarized_tli,
 								  XLogRecPtr *pending_lsn,
 								  int *summarizer_pid);
 extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
-										   bool *lsn_is_exact,
-										   bool reset_pending_lsn);
+										   bool *lsn_is_exact);
 extern void SetWalSummarizerLatch(void);
 extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
 										  XLogRecPtr *pending_lsn);
-- 
2.39.3 (Apple Git-145)

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#4)
Re: Apparent bug in WAL summarizer process (hung state)

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, this is a bug. It seems that the WAL summarizer process, when
restarted, wants to restart from wherever it was previously
summarizing WAL, which is correct if that WAL is still around, but if
summarize_wal has been turned off in the meanwhile, it might not be
correct. Here's a patch to fix that.

This comment seems to be truncated:

+    /*
+     * If we're the WAL summarizer, we always want to store the values we
+     * just computed into shared memory, because those are the values we're
+     * going to use to drive our operation, and so they are the authoritative
+     * values. Otherwise, we only store values into shared memory if they are
+     */
+    LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+    if (am_wal_summarizer|| !WalSummarizerCtl->initialized)
+    {

regards, tom lane

#6Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#5)
1 attachment(s)
Re: Apparent bug in WAL summarizer process (hung state)

On Tue, Jun 25, 2024 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This comment seems to be truncated:

Thanks. New version attached.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v2-0001-Prevent-summarizer-hang-when-summarize_wal-turned.patchapplication/octet-stream; name=v2-0001-Prevent-summarizer-hang-when-summarize_wal-turned.patchDownload
From 31a95324516e901d16e16111828112112dd302f4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 25 Jun 2024 15:42:36 -0400
Subject: [PATCH v2] Prevent summarizer hang when summarize_wal turned off and
 back on.

Before this commit, when the WAL summarizer started up or recovered
from an error, it would resume summarization from wherever it left
off. That was OK normally, but wrong if summarize_wal=off had been
turned off temporary, allowing some WAL to be removed, and then turned
back on again. In such cases, the WAL summarizer would simply hang
forever. This commit changes the reinitialization sequence for WAL
summarizer to rederive the starting position in the way we were
already doing at initial startup, fixing the problem.

Per report from Israel Barth Rubio. Reviewed by Tom Lane.
---
 src/backend/access/transam/xlog.c      |  2 +-
 src/backend/postmaster/walsummarizer.c | 83 +++++++++++++-------------
 src/include/postmaster/walsummarizer.h |  3 +-
 3 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..a9784a4e10 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7906,7 +7906,7 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 	 * If WAL summarization is in use, don't remove WAL that has yet to be
 	 * summarized.
 	 */
-	keep = GetOldestUnsummarizedLSN(NULL, NULL, false);
+	keep = GetOldestUnsummarizedLSN(NULL, NULL);
 	if (keep != InvalidXLogRecPtr)
 	{
 		XLogSegNo	unsummarized_segno;
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 72f6c04478..7f987af40d 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -337,7 +337,7 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 	 *
 	 * If we discover that WAL summarization is not enabled, just exit.
 	 */
-	current_lsn = GetOldestUnsummarizedLSN(&current_tli, &exact, true);
+	current_lsn = GetOldestUnsummarizedLSN(&current_tli, &exact);
 	if (XLogRecPtrIsInvalid(current_lsn))
 		proc_exit(0);
 
@@ -479,23 +479,18 @@ GetWalSummarizerState(TimeLineID *summarized_tli, XLogRecPtr *summarized_lsn,
 
 /*
  * Get the oldest LSN in this server's timeline history that has not yet been
- * summarized.
+ * summarized, and update shared memory state as appropriate.
  *
  * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
  *
  * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
  * necessarily the start of a WAL record and false if it's just the beginning
  * of a WAL segment.
- *
- * If reset_pending_lsn is true, resets the pending_lsn in shared memory to
- * be equal to the summarized_lsn.
  */
 XLogRecPtr
-GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
-						 bool reset_pending_lsn)
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
 {
 	TimeLineID	latest_tli;
-	LWLockMode	mode = reset_pending_lsn ? LW_EXCLUSIVE : LW_SHARED;
 	int			n;
 	List	   *tles;
 	XLogRecPtr	unsummarized_lsn = InvalidXLogRecPtr;
@@ -503,22 +498,21 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 	bool		should_make_exact = false;
 	List	   *existing_summaries;
 	ListCell   *lc;
+	bool		am_wal_summarizer = AmWalSummarizerProcess();
 
 	/* If not summarizing WAL, do nothing. */
 	if (!summarize_wal)
 		return InvalidXLogRecPtr;
 
 	/*
-	 * Unless we need to reset the pending_lsn, we initially acquire the lock
-	 * in shared mode and try to fetch the required information. If we acquire
-	 * in shared mode and find that the data structure hasn't been
-	 * initialized, we reacquire the lock in exclusive mode so that we can
-	 * initialize it. However, if someone else does that first before we get
-	 * the lock, then we can just return the requested information after all.
+	 * If we are not the WAL summarizer process, then we normally just want
+	 * to read the values from shared memory. However, as an exception, if
+	 * shared memory hasn't been initialized yet, then we need to do that so
+	 * that we can read legal values and not remove any WAL too early.
 	 */
-	while (1)
+	if (!am_wal_summarizer)
 	{
-		LWLockAcquire(WALSummarizerLock, mode);
+		LWLockAcquire(WALSummarizerLock, LW_SHARED);
 
 		if (WalSummarizerCtl->initialized)
 		{
@@ -527,27 +521,22 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 				*tli = WalSummarizerCtl->summarized_tli;
 			if (lsn_is_exact != NULL)
 				*lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
-			if (reset_pending_lsn)
-				WalSummarizerCtl->pending_lsn =
-					WalSummarizerCtl->summarized_lsn;
 			LWLockRelease(WALSummarizerLock);
 			return unsummarized_lsn;
 		}
 
-		if (mode == LW_EXCLUSIVE)
-			break;
-
 		LWLockRelease(WALSummarizerLock);
-		mode = LW_EXCLUSIVE;
 	}
 
 	/*
-	 * The data structure needs to be initialized, and we are the first to
-	 * obtain the lock in exclusive mode, so it's our job to do that
-	 * initialization.
+	 * Find the oldest timeline on which WAL still exists, and the earliest
+	 * segment for which it exists.
 	 *
-	 * So, find the oldest timeline on which WAL still exists, and the
-	 * earliest segment for which it exists.
+	 * Note that we do this every time the WAL summarizer process restarts
+	 * or recovers from an error, in case the contents of pg_wal have changed
+	 * under us e.g. if some files were removed, either manually - which
+	 * shouldn't really happen, but might - or by postgres itself, if
+	 * summarize_wal was turned off and then back on again.
 	 */
 	(void) GetLatestLSN(&latest_tli);
 	tles = readTimeLineHistory(latest_tli);
@@ -568,12 +557,6 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 		}
 	}
 
-	/* It really should not be possible for us to find no WAL. */
-	if (unsummarized_tli == 0)
-		ereport(ERROR,
-				errcode(ERRCODE_INTERNAL_ERROR),
-				errmsg_internal("no WAL found on timeline %u", latest_tli));
-
 	/*
 	 * Don't try to summarize anything older than the end LSN of the newest
 	 * summary file that exists for this timeline.
@@ -592,12 +575,32 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 		}
 	}
 
-	/* Update shared memory with the discovered values. */
-	WalSummarizerCtl->initialized = true;
-	WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
-	WalSummarizerCtl->summarized_tli = unsummarized_tli;
-	WalSummarizerCtl->lsn_is_exact = should_make_exact;
-	WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+	/* It really should not be possible for us to find no WAL. */
+	if (unsummarized_tli == 0)
+		ereport(ERROR,
+				errcode(ERRCODE_INTERNAL_ERROR),
+				errmsg_internal("no WAL found on timeline %u", latest_tli));
+
+	/*
+	 * If we're the WAL summarizer, we always want to store the values we
+	 * just computed into shared memory, because those are the values we're
+	 * going to use to drive our operation, and so they are the authoritative
+	 * values. Otherwise, we only store values into shared memory if shared
+	 * memory is uninitialized. Our values are not canonical in such a case,
+	 * but it's better to have something than nothing, to guide WAL
+	 * retention.
+	 */
+	LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+	if (am_wal_summarizer|| !WalSummarizerCtl->initialized)
+	{
+		WalSummarizerCtl->initialized = true;
+		WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+		WalSummarizerCtl->summarized_tli = unsummarized_tli;
+		WalSummarizerCtl->lsn_is_exact = should_make_exact;
+		WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+	}
+	else
+		unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
 
 	/* Also return the to the caller as required. */
 	if (tli != NULL)
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
index ad346d0c11..112bc1e6cb 100644
--- a/src/include/postmaster/walsummarizer.h
+++ b/src/include/postmaster/walsummarizer.h
@@ -28,8 +28,7 @@ extern void GetWalSummarizerState(TimeLineID *summarized_tli,
 								  XLogRecPtr *pending_lsn,
 								  int *summarizer_pid);
 extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
-										   bool *lsn_is_exact,
-										   bool reset_pending_lsn);
+										   bool *lsn_is_exact);
 extern void SetWalSummarizerLatch(void);
 extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
 										  XLogRecPtr *pending_lsn);
-- 
2.39.3 (Apple Git-145)

#7Israel Barth Rubio
barthisrael@gmail.com
In reply to: Robert Haas (#6)
Re: Apparent bug in WAL summarizer process (hung state)

Yeah, this is a bug. It seems that the WAL summarizer process, when
restarted, wants to restart from wherever it was previously
summarizing WAL, which is correct if that WAL is still around, but if
summarize_wal has been turned off in the meanwhile, it might not be
correct. Here's a patch to fix that.

Thanks for checking this!

Thanks. New version attached.

And besides that, thanks for the patch, of course!

I compiled Postgres locally with your patch. I attempted to break it several
times, both manually and through a shell script.

No success on that -- which in this case is actually success :)
The WAL summarizer seems able to always resume from a valid point,
so `pg_basebackup` isn't failing anymore.