Race between KeepFileRestoredFromArchive() and restartpoint

Started by Noah Mischalmost 5 years ago15 messages

noah@leadboat.com

almost 5 years ago

1 attachment(s)

A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as
seen once on the buildfarm[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17. The attached patch adds a test case; it
applies atop the "stop events" patch[2]/messages/by-id/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com. We have two systems for adding
long-term pg_wal directory entries. KeepFileRestoredFromArchive() adds them
during archive recovery, while InstallXLogFileSegment() does so at all times.
Unfortunately, InstallXLogFileSegment() happens throughout archive recovery,
via the checkpointer recycling segments and calling PreallocXlogFiles().
Multiple processes can run InstallXLogFileSegment(), which uses
ControlFileLock to represent the authority to modify the directory listing of
pg_wal. KeepFileRestoredFromArchive() just assumes it controls pg_wal.

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Notable alternatives:

- Release ControlFileLock at the end of XLogFileInit(), not at the end of
InstallXLogFileSegment(). Add ControlFileLock acquisition to
KeepFileRestoredFromArchive(). This provides adequate mutual exclusion, but
XLogFileInit() could return a file descriptor for an unlinked file. That's
fine for PreallocXlogFiles(), but it feels wrong.

- During restartpoints, never preallocate or recycle segments. (Just delete
obsolete WAL.) By denying those benefits, this presumably makes streaming
recovery less efficient.

- Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment,
then copy bytes. This is simple, but it multiplies I/O. That might be
negligible on account of caching, or it might not be. A variant, incurring
extra fsyncs, would be to use durable_rename() to replace the segment we get
from XLogFileInit().

- Make KeepFileRestoredFromArchive() rename without first unlinking. This
avoids checkpoint failure, but a race could trigger noise from the LOG
message in InstallXLogFileSegment -> durable_rename_excl.

Does anyone prefer some alternative? It's probably not worth back-patching
anything for a restartpoint failure this rare, because most restartpoint
outcomes are not user-visible.

Thanks,
nm

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17
[2]: /messages/by-id/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com

Attachments:

repro-KeepFileRestoredFromArchive-v0.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Not for commit; for demonstration only.
    
    Before applying this, apply
    https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 199d911..1c0a988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -71,6 +71,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/stopevent.h"
 #include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -3583,6 +3584,23 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
+static Jsonb *
+InstallXLogFileSegmentEndStopEventParams(XLogSegNo segno)
+{
+	MemoryContext oldCxt;
+	JsonbParseState *state = NULL;
+	Jsonb	   *res;
+
+	stopevents_make_cxt();
+	oldCxt = MemoryContextSwitchTo(stopevents_cxt);
+	pushJsonbValue(&state, WJB_BEGIN_OBJECT, NULL);
+	jsonb_push_int8_key(&state, "segno", segno);
+	res = JsonbValueToJsonb(pushJsonbValue(&state, WJB_END_OBJECT, NULL));
+	MemoryContextSwitchTo(oldCxt);
+
+	return res;
+}
+
 /*
  * Install a new XLOG segment file as a current or future log segment.
  *
@@ -3664,6 +3682,9 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	if (use_lock)
 		LWLockRelease(ControlFileLock);
 
+	STOPEVENT(InstallXLogFileSegmentEndStopEvent,
+			  InstallXLogFileSegmentEndStopEventParams(*segno));
+
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8..d9c5a3a 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -30,6 +30,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
+#include "storage/stopevent.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -372,6 +373,23 @@ ExecuteRecoveryCommand(const char *command, const char *commandName, bool failOn
 }
 
 
+static Jsonb *
+KeepFileRestoredFromArchiveStopEventParams(const char *xlogfname)
+{
+	MemoryContext oldCxt;
+	JsonbParseState *state = NULL;
+	Jsonb	   *res;
+
+	stopevents_make_cxt();
+	oldCxt = MemoryContextSwitchTo(stopevents_cxt);
+	pushJsonbValue(&state, WJB_BEGIN_OBJECT, NULL);
+	jsonb_push_string_key(&state, "xlogfname", xlogfname);
+	res = JsonbValueToJsonb(pushJsonbValue(&state, WJB_END_OBJECT, NULL));
+	MemoryContextSwitchTo(oldCxt);
+
+	return res;
+}
+
 /*
  * A file was restored from the archive under a temporary filename (path),
  * and now we want to keep it. Rename it under the permanent filename in
@@ -386,6 +404,9 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 
 	snprintf(xlogfpath, MAXPGPATH, XLOGDIR "/%s", xlogfname);
 
+	STOPEVENT(KeepFileRestoredFromArchiveStopEvent,
+			  KeepFileRestoredFromArchiveStopEventParams(xlogfname));
+
 	if (stat(xlogfpath, &statbuf) == 0)
 	{
 		char		oldpath[MAXPGPATH];
@@ -424,6 +445,9 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 		reload = true;
 	}
 
+	STOPEVENT(KeepFileRestoredFromArchivePreRenameStopEvent,
+			  KeepFileRestoredFromArchiveStopEventParams(xlogfname));
+
 	durable_rename(path, xlogfpath, ERROR);
 
 	/*
diff --git a/src/backend/storage/lmgr/stopevent.c b/src/backend/storage/lmgr/stopevent.c
index 6b6eb2b..251bdfe 100644
--- a/src/backend/storage/lmgr/stopevent.c
+++ b/src/backend/storage/lmgr/stopevent.c
@@ -280,12 +280,14 @@ handle_stopevent(int event_id, Jsonb *params)
 	if (trace_stopevents)
 	{
 		char	   *params_string;
+		MemoryContext oldCxt;
 
+		oldCxt = MemoryContextSwitchTo(stopevents_cxt);
 		params_string = DatumGetCString(DirectFunctionCall1(jsonb_out, PointerGetDatum(params)));
 		elog(DEBUG2, "stop event \"%s\", params \"%s\"",
 			 stopeventnames[event_id],
 			 params_string);
-		pfree(params_string);
+		MemoryContextSwitchTo(oldCxt);
 	}
 
 	MemoryContextReset(stopevents_cxt);
@@ -342,7 +344,12 @@ void
 relation_stopevent_params(JsonbParseState **state, Relation relation)
 {
 	jsonb_push_int8_key(state, "datoid", MyDatabaseId);
+#if 0
+	/* infinite recursion to segfault */
 	jsonb_push_string_key(state, "datname", get_database_name(MyDatabaseId));
+#else
+	jsonb_push_string_key(state, "datname", "fake");
+#endif
 	jsonb_push_int8_key(state, "reloid", relation->rd_id);
 	jsonb_push_string_key(state, "relname", NameStr(relation->rd_rel->relname));
 }
diff --git a/src/backend/storage/lmgr/stopeventnames.txt b/src/backend/storage/lmgr/stopeventnames.txt
index b5330a1..aadff7a 100644
--- a/src/backend/storage/lmgr/stopeventnames.txt
+++ b/src/backend/storage/lmgr/stopeventnames.txt
@@ -1,2 +1,5 @@
+InstallXLogFileSegmentEndStopEvent
+KeepFileRestoredFromArchiveStopEvent
+KeepFileRestoredFromArchivePreRenameStopEvent
 ReleaseAndReadBufferStopEvent
 EntryFindPostingLeafPageStopEvent
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 84472b9..4c217d5 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -23,6 +23,7 @@
 
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "storage/stopevent.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
@@ -126,6 +127,10 @@ MemoryContextInit(void)
 										 8 * 1024,
 										 8 * 1024);
 	MemoryContextAllowInCriticalSection(ErrorContext, true);
+
+	stopevents_make_cxt();
+	/* not really okay; for a demo patch, I'll take my chances */
+	MemoryContextAllowInCriticalSection(stopevents_cxt, true);
 }
 
 /*
diff --git a/src/include/storage/stopevent.h b/src/include/storage/stopevent.h
index a8b62f2..53dbf98 100644
--- a/src/include/storage/stopevent.h
+++ b/src/include/storage/stopevent.h
@@ -2,6 +2,7 @@
 #define SRC_STOPEVENT_H
 
 #include "utils/jsonb.h"
+#include "utils/rel.h"
 #include "storage/stopeventnames.h"
 
 extern bool enable_stopevents;
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f76..caa6de5 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1737,7 +1737,7 @@ sub poll_query_until
 
 	my $cmd = [ 'psql', '-XAt', '-c', $query, '-d', $self->connstr($dbname) ];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = 3 * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
diff --git a/src/test/recovery/t/022_install_xlog.pl b/src/test/recovery/t/022_install_xlog.pl
new file mode 100644
index 0000000..16d8a22
--- /dev/null
+++ b/src/test/recovery/t/022_install_xlog.pl
@@ -0,0 +1,96 @@
+#
+# Test InstallXLogFileSegment()/KeepFileRestoredFromArchive() mutual exclusion
+#
+# Specifically, recreate this sequence of events:
+#
+# checkpointer enters XLogFileInit, finds no segno:2 file
+# checkpointer finishes InstallXLogFileSegment, creating segno:2 file
+# startupProc enters KeepFileRestoredFromArchive, unlinks segno:2
+# checkpointer enters the last BasicOpenFile of XLogFileInit, which fails
+# startupProc finishes KeepFileRestoredFromArchive (too late)
+#
+use strict;
+use warnings;
+use PostgresNode;
+use Test::More tests => 1;
+
+my $primary = get_new_node('primary');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+$primary->stop('immediate');
+$primary->backup_fs_cold('backup');
+$primary->start;
+
+# Standby will get these two segments from the archive.
+$primary->safe_psql(
+	'postgres', q{
+	-- Standby makes a restart point from this checkpoint.
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	CHECKPOINT;
+	SELECT pg_switch_wal();
+	-- Standby restores this segment in parallel with its restart point,
+	-- exercising the race condition.
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+});
+
+
+my $standby = get_new_node('standby');
+$standby->init_from_backup($primary, 'backup', has_restoring => 1);
+# exercise PreallocXlogFiles(), not recycling
+$standby->append_conf('postgresql.conf', 'wal_keep_size = 160MB');
+# break restore_command so we can first inject stop events
+$standby->append_conf('postgresql.conf', 'restore_command = false');
+$standby->start;
+
+# register all stop events
+$standby->safe_psql(
+	'postgres', q{
+ select pg_stopevent_set('InstallXLogFileSegmentEnd', '$.segno == 2');
+ select pg_stopevent_set('KeepFileRestoredFromArchive',
+  '$.xlogfname == "000000010000000000000002"');
+ select pg_stopevent_set('KeepFileRestoredFromArchivePreRename',
+  '$.xlogfname == "000000010000000000000002"');
+});
+# repair restore_command
+$standby->enable_restoring($primary, 1);
+$standby->safe_psql('postgres', q{SELECT pg_reload_conf();});
+
+
+# poll until some process is blocked as a pg_stopevent_set() directed
+sub poll_until_event
+{
+	my $event = shift;
+	$standby->poll_query_until(
+		'postgres',
+		qq{ SELECT true FROM pg_stopevents() WHERE cardinality(waiters) <> 0
+			AND event_name = '$event'; }
+	) or die "Timed out while waiting for $event on standby";
+}
+
+poll_until_event 'KeepFileRestoredFromArchive';    # segno==2 restored
+
+# Begin a restartpoint, pausing at CreateRestartPoint -> PreallocXlogFiles ->
+# XLogFileInit -> InstallXLogFileSegment -> InstallXLogFileSegmentEnd.
+my $in    = '';
+my $out   = '';
+my $timer = IPC::Run::timeout(180);
+my $h     = $standby->background_psql('postgres', \$in, \$out, $timer);
+$in .= "CHECKPOINT;\n";
+$h->pump_nb;
+poll_until_event 'InstallXLogFileSegmentEnd';
+
+# release startupProc to unlink segno==2
+$standby->safe_psql('postgres',
+	q{SELECT pg_stopevent_reset('KeepFileRestoredFromArchive');});
+poll_until_event 'KeepFileRestoredFromArchivePreRename';
+# release checkpointer to open segno==2
+$standby->safe_psql('postgres',
+	q{SELECT pg_stopevent_reset('InstallXLogFileSegmentEnd');});
+ok($h->finish, 'CHECKPOINT completed');
+# release startupProc to rename file onto segno==2, but it's too late
+$standby->safe_psql('postgres',
+	q{SELECT pg_stopevent_reset('KeepFileRestoredFromArchivePreRename');});

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Noah Misch (#1)

5 attachment(s)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

Attachments:

XLogFileInit1-use_lock-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Remove XLogFileInit() ability to skip ControlFileLock.
    
    Cold paths, initdb and end-of-recovery, used it.  Don't optimize them.
    
    Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b3a3d9..39a38ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -913,8 +913,7 @@ static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-								   bool find_free, XLogSegNo max_segno,
-								   bool use_lock);
+								   bool find_free, XLogSegNo max_segno);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
@@ -2492,7 +2491,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent);
 			ReserveExternalFD();
 		}
 
@@ -3265,10 +3264,6 @@ XLogNeedsFlush(XLogRecPtr record)
  * pre-existing file will be deleted).  On return, true if a pre-existing
  * file was used.
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns FD of opened file.
  *
  * Note: errors here are ERROR not PANIC because we might or might not be
@@ -3277,7 +3272,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3437,8 +3432,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, max_segno,
-								use_lock))
+								*use_existent, max_segno))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3592,7 +3586,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3616,29 +3610,20 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * free slot is found between *segno and max_segno. (Ignored when find_free
  * is false.)
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns true if the file was installed successfully.  false indicates that
  * max_segno limit was exceeded, or an error occurred while renaming the
  * file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock)
+					   bool find_free, XLogSegNo max_segno)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
 
 	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
 
-	/*
-	 * We want to be sure that only one process does this at a time.
-	 */
-	if (use_lock)
-		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
 	if (!find_free)
 	{
@@ -3653,8 +3638,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 			if ((*segno) >= max_segno)
 			{
 				/* Failed to find a free slot within specified range */
-				if (use_lock)
-					LWLockRelease(ControlFileLock);
+				LWLockRelease(ControlFileLock);
 				return false;
 			}
 			(*segno)++;
@@ -3668,14 +3652,12 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	 */
 	if (durable_rename_excl(tmppath, path, LOG) != 0)
 	{
-		if (use_lock)
-			LWLockRelease(ControlFileLock);
+		LWLockRelease(ControlFileLock);
 		/* durable_rename_excl already emitted log message */
 		return false;
 	}
 
-	if (use_lock)
-		LWLockRelease(ControlFileLock);
+	LWLockRelease(ControlFileLock);
 
 	return true;
 }
@@ -3946,7 +3928,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
+		lf = XLogFileInit(_logSegNo, &use_existent);
 		close(lf);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
@@ -4223,7 +4205,7 @@ RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		*endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(endlogSegNo, path,
-							   true, recycleSegNo, true))
+							   true, recycleSegNo))
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("recycled write-ahead log file \"%s\"",
@@ -5341,7 +5323,7 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5650,7 +5632,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		bool		use_existent = true;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index faeea9f..b00066e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -924,7 +924,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent);
 			recvFileTLI = ThisTimeLineID;
 		}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c1..a175649 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -296,7 +296,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
+extern int	XLogFileInit(XLogSegNo segno, bool *use_existent);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);

XLogFileInit2-redefine-use_existent-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    In XLogFileInit(), fix *use_existent postcondition to suit callers.
    
    Infrequently, the mismatch caused log_checkpoints messages and
    TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high
    by one.  Since that consequence is so minor, no back-patch.
    
    Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 39a38ba..073dabc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3261,8 +3261,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * logsegno: identify segment to be created/opened.
  *
  * *use_existent: if true, OK to use a pre-existing file (else, any
- * pre-existing file will be deleted).  On return, true if a pre-existing
- * file was used.
+ * pre-existing file will be deleted).  On return, false iff this call added
+ * some segment on disk.
  *
  * Returns FD of opened file.
  *
@@ -3431,8 +3431,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	 * CheckPointSegments.
 	 */
 	max_segno = logsegno + CheckPointSegments;
-	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, max_segno))
+	if (InstallXLogFileSegment(&installed_segno, tmppath,
+							   *use_existent, max_segno))
+		*use_existent = false;
+	else
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3442,9 +3444,6 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 		unlink(tmppath);
 	}
 
-	/* Set flag to tell caller there was no existent file */
-	*use_existent = false;
-
 	/* Now open original target segment (might not be file I just made) */
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
 	if (fd < 0)

XLogFileInit3-write-only-use_existent-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Remove XLogFileInit() ability to unlink a pre-existing file.
    
    Only initdb used it.  initdb refuses to operate on a non-empty directory
    and generally does not cope with pre-existing files of other kinds.
    Hence, use the opportunity to simplify.
    
    Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 073dabc..1a31788 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2424,7 +2424,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	bool		ispartialpage;
 	bool		last_iteration;
 	bool		finishing_seg;
-	bool		use_existent;
+	bool		added;
 	int			curridx;
 	int			npages;
 	int			startidx;
@@ -2490,8 +2490,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 							wal_segment_size);
 
 			/* create/use new log file */
-			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent);
+			openLogFile = XLogFileInit(openLogSegNo, &added);
 			ReserveExternalFD();
 		}
 
@@ -3260,9 +3259,7 @@ XLogNeedsFlush(XLogRecPtr record)
  *
  * logsegno: identify segment to be created/opened.
  *
- * *use_existent: if true, OK to use a pre-existing file (else, any
- * pre-existing file will be deleted).  On return, false iff this call added
- * some segment on disk.
+ * *added: on return, true if this call raised the number of extant segments.
  *
  * Returns FD of opened file.
  *
@@ -3272,7 +3269,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent)
+XLogFileInit(XLogSegNo logsegno, bool *added)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3287,19 +3284,17 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	/*
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
-	if (*use_existent)
+	*added = false;
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	if (fd < 0)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
-		{
-			if (errno != ENOENT)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not open file \"%s\": %m", path)));
-		}
-		else
-			return fd;
+		if (errno != ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
 	}
+	else
+		return fd;
 
 	/*
 	 * Initialize an empty (all zeroes) segment.  NOTE: it is possible that
@@ -3412,12 +3407,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
 	/*
-	 * Now move the segment into place with its final name.
-	 *
-	 * If caller didn't want to use a pre-existing file, get rid of any
-	 * pre-existing file.  Otherwise, cope with possibility that someone else
-	 * has created the file while we were filling ours: if so, use ours to
-	 * pre-create a future log segment.
+	 * Now move the segment into place with its final name.  Cope with
+	 * possibility that someone else has created the file while we were
+	 * filling ours: if so, use ours to pre-create a future log segment.
 	 */
 	installed_segno = logsegno;
 
@@ -3431,9 +3423,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	 * CheckPointSegments.
 	 */
 	max_segno = logsegno + CheckPointSegments;
-	if (InstallXLogFileSegment(&installed_segno, tmppath,
-							   *use_existent, max_segno))
-		*use_existent = false;
+	if (InstallXLogFileSegment(&installed_segno, tmppath, true, max_segno))
+		*added = true;
 	else
 	{
 		/*
@@ -3918,7 +3909,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 {
 	XLogSegNo	_logSegNo;
 	int			lf;
-	bool		use_existent;
+	bool		added;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3926,10 +3917,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	if (offset >= (uint32) (0.75 * wal_segment_size))
 	{
 		_logSegNo++;
-		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent);
+		lf = XLogFileInit(_logSegNo, &added);
 		close(lf);
-		if (!use_existent)
+		if (added)
 			CheckpointStats.ckpt_segs_added++;
 	}
 }
@@ -5224,7 +5214,7 @@ BootStrapXLOG(void)
 	XLogLongPageHeader longpage;
 	XLogRecord *record;
 	char	   *recptr;
-	bool		use_existent;
+	bool		added;
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
@@ -5321,8 +5311,7 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent);
+	openLogFile = XLogFileInit(1, &added);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5628,10 +5617,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 * The switch happened at a segment boundary, so just create the next
 		 * segment on the new timeline.
 		 */
-		bool		use_existent = true;
+		bool		added;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent);
+		fd = XLogFileInit(startLogSegNo, &added);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b00066e..eadff8f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -885,7 +885,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
-			bool		use_existent;
+			bool		added;
 
 			/*
 			 * fsync() and close current file before we switch to next one. We
@@ -923,8 +923,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent);
+			recvFile = XLogFileInit(recvSegNo, &added);
 			recvFileTLI = ThisTimeLineID;
 		}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a175649..d8b8f59 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -296,7 +296,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent);
+extern int	XLogFileInit(XLogSegNo segno, bool *added);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);

XLogFileInit4-PreallocXlogFiles-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Don't ERROR on PreallocXlogFiles() race condition.
    
    Before a restartpoint finishes PreallocXlogFiles(), a startup process
    KeepFileRestoredFromArchive() call can unlink the preallocated segment.
    If a CHECKPOINT sql command had elicited the restartpoint experiencing
    the race condition, that sql command failed.  Moreover, the restartpoint
    omitted its log_checkpoints message and some inessential resource
    reclamation.  Prevent the ERROR by skipping open() of the segment.
    Since these consequences are so minor, no back-patch.
    
    Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1a31788..8cda20d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2424,7 +2424,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	bool		ispartialpage;
 	bool		last_iteration;
 	bool		finishing_seg;
-	bool		added;
 	int			curridx;
 	int			npages;
 	int			startidx;
@@ -2490,7 +2489,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 							wal_segment_size);
 
 			/* create/use new log file */
-			openLogFile = XLogFileInit(openLogSegNo, &added);
+			openLogFile = XLogFileInit(openLogSegNo);
 			ReserveExternalFD();
 		}
 
@@ -3255,23 +3254,21 @@ XLogNeedsFlush(XLogRecPtr record)
 }
 
 /*
- * Create a new XLOG file segment, or open a pre-existing one.
+ * Try to make a given XLOG file segment exist.
  *
- * logsegno: identify segment to be created/opened.
+ * logsegno: identify segment.
  *
  * *added: on return, true if this call raised the number of extant segments.
  *
- * Returns FD of opened file.
+ * path: on return, this char[MAXPGPATH] has the path to the logsegno file.
  *
- * Note: errors here are ERROR not PANIC because we might or might not be
- * inside a critical section (eg, during checkpoint there is no reason to
- * take down the system on failure).  They will promote to PANIC if we are
- * in a critical section.
+ * Returns -1 or FD of opened file.  A -1 here is not an error; a caller
+ * wanting an open segment should attempt to open "path", which usually will
+ * succeed.  (This is weird, but it's efficient for the callers.)
  */
-int
-XLogFileInit(XLogSegNo logsegno, bool *added)
+static int
+XLogFileInitInternal(XLogSegNo logsegno, bool *added, char *path)
 {
-	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
@@ -3424,26 +3421,53 @@ XLogFileInit(XLogSegNo logsegno, bool *added)
 	 */
 	max_segno = logsegno + CheckPointSegments;
 	if (InstallXLogFileSegment(&installed_segno, tmppath, true, max_segno))
+	{
 		*added = true;
+		elog(DEBUG2, "done creating and filling new WAL file");
+	}
 	else
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
-		 * failed to rename the file into place. If the rename failed, opening
-		 * the file below will fail.
+		 * failed to rename the file into place. If the rename failed, a
+		 * caller opening the file may fail.
 		 */
 		unlink(tmppath);
+		elog(DEBUG2, "abandoned new WAL file");
 	}
 
+	return -1;
+}
+
+/*
+ * Create a new XLOG file segment, or open a pre-existing one.
+ *
+ * logsegno: identify segment to be created/opened.
+ *
+ * Returns FD of opened file.
+ *
+ * Note: errors here are ERROR not PANIC because we might or might not be
+ * inside a critical section (eg, during checkpoint there is no reason to
+ * take down the system on failure).  They will promote to PANIC if we are
+ * in a critical section.
+ */
+int
+XLogFileInit(XLogSegNo logsegno)
+{
+	bool		ignore_added;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	fd = XLogFileInitInternal(logsegno, &ignore_added, path);
+	if (fd >= 0)
+		return fd;
+
 	/* Now open original target segment (might not be file I just made) */
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
-
-	elog(DEBUG2, "done creating and filling new WAL file");
-
 	return fd;
 }
 
@@ -3903,6 +3927,15 @@ XLogFileClose(void)
  * High-volume systems will be OK once they've built up a sufficient set of
  * recycled log segments, but the startup transient is likely to include
  * a lot of segment creations by foreground processes, which is not so good.
+ *
+ * XLogFileInitInternal() can ereport(ERROR).  All known causes indicate big
+ * trouble; for example, a full filesystem is one cause.  The checkpoint WAL
+ * and/or ControlFile updates already completed.  If a RequestCheckpoint()
+ * initiated the present checkpoint and an ERROR ends this function, the
+ * command that called RequestCheckpoint() fails.  That's not ideal, but it's
+ * not worth contorting more functions to use caller-specified elevel values.
+ * (With or without RequestCheckpoint(), an ERROR forestalls some inessential
+ * reporting and resource reclamation.)
  */
 static void
 PreallocXlogFiles(XLogRecPtr endptr)
@@ -3910,6 +3943,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		added;
+	char		path[MAXPGPATH];
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3917,8 +3951,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	if (offset >= (uint32) (0.75 * wal_segment_size))
 	{
 		_logSegNo++;
-		lf = XLogFileInit(_logSegNo, &added);
-		close(lf);
+		lf = XLogFileInitInternal(_logSegNo, &added, path);
+		if (lf >= 0)
+			close(lf);
 		if (added)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -5214,7 +5249,6 @@ BootStrapXLOG(void)
 	XLogLongPageHeader longpage;
 	XLogRecord *record;
 	char	   *recptr;
-	bool		added;
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
@@ -5311,7 +5345,7 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	openLogFile = XLogFileInit(1, &added);
+	openLogFile = XLogFileInit(1);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5617,10 +5651,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 * The switch happened at a segment boundary, so just create the next
 		 * segment on the new timeline.
 		 */
-		bool		added;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &added);
+		fd = XLogFileInit(startLogSegNo);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index eadff8f..2be9ad9 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -885,8 +885,6 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
-			bool		added;
-
 			/*
 			 * fsync() and close current file before we switch to next one. We
 			 * would otherwise have to reopen this file to fsync it later
@@ -923,7 +921,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			recvFile = XLogFileInit(recvSegNo, &added);
+			recvFile = XLogFileInit(recvSegNo);
 			recvFileTLI = ThisTimeLineID;
 		}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d8b8f59..7510e88 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -296,7 +296,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *added);
+extern int	XLogFileInit(XLogSegNo segno);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);

XLogFileInit5-lock-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Skip WAL recycling and preallocation during archive recovery.
    
    The previous commit addressed the chief consequences of a race condition
    between InstallXLogFileSegment() and KeepFileRestoredFromArchive().  Fix
    three lesser consequences.  A spurious durable_rename_excl() LOG message
    remained possible.  KeepFileRestoredFromArchive() wasted the proceeds of
    WAL recycling and preallocation.  Finally, XLogFileInitInternal() could
    return a descriptor for a file that KeepFileRestoredFromArchive() had
    already unlinked.  That felt like a recipe for future bugs.
    
    Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8cda20d..2c6e21b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -663,6 +663,16 @@ typedef struct XLogCtlData
 	bool		SharedHotStandbyActive;
 
 	/*
+	 * InstallXLogFileSegmentActive indicates whether the checkpointer should
+	 * arrange for future segments by recycling and/or PreallocXlogFiles().
+	 * Protected by ControlFileLock.  Only the startup process changes it.  If
+	 * true, anyone can use InstallXLogFileSegment().  If false, the startup
+	 * process owns the exclusive right to install segments, by reading from
+	 * the archive and possibly replacing existing files.
+	 */
+	bool		InstallXLogFileSegmentActive;
+
+	/*
 	 * SharedPromoteIsTriggered indicates if a standby promotion has been
 	 * triggered.  Protected by info_lck.
 	 */
@@ -921,6 +931,7 @@ static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+static void XLogShutdownWalRcv(void);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -3625,8 +3636,8 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * is false.)
  *
  * Returns true if the file was installed successfully.  false indicates that
- * max_segno limit was exceeded, or an error occurred while renaming the
- * file into place.
+ * max_segno limit was exceeded, the startup process has disabled this
+ * function for now, or an error occurred while renaming the file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
@@ -3638,6 +3649,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
 
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	if (!XLogCtl->InstallXLogFileSegmentActive)
+	{
+		LWLockRelease(ControlFileLock);
+		return false;
+	}
 
 	if (!find_free)
 	{
@@ -3745,6 +3761,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 	 */
 	if (source == XLOG_FROM_ARCHIVE)
 	{
+		Assert(!XLogCtl->InstallXLogFileSegmentActive);
 		KeepFileRestoredFromArchive(path, xlogfname);
 
 		/*
@@ -3946,6 +3963,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	char		path[MAXPGPATH];
 	uint64		offset;
 
+	if (!XLogCtl->InstallXLogFileSegmentActive)
+		return;					/* unlocked check says no */
+
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
 	offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
 	if (offset >= (uint32) (0.75 * wal_segment_size))
@@ -4227,6 +4247,7 @@ RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 	 */
 	if (wal_recycle &&
 		*endlogSegNo <= recycleSegNo &&
+		XLogCtl->InstallXLogFileSegmentActive &&	/* callee rechecks this */
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(endlogSegNo, path,
 							   true, recycleSegNo))
@@ -4240,7 +4261,7 @@ RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 	}
 	else
 	{
-		/* No need for any more future segments... */
+		/* No need for any more future segments, or recycling failed ... */
 		int			rc;
 
 		ereport(DEBUG2,
@@ -5226,6 +5247,7 @@ XLOGShmemInit(void)
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
+	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -5253,6 +5275,11 @@ BootStrapXLOG(void)
 	struct timeval tv;
 	pg_crc32c	crc;
 
+	/* allow ordinary WAL segment creation, like StartupXLOG() would */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = true;
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * Select a hopefully-unique system identifier code for this installation.
 	 * We use the result of gettimeofday(), including the fractional seconds
@@ -7619,7 +7646,7 @@ StartupXLOG(void)
 	 * the startup checkpoint record. It will trump over the checkpoint and
 	 * subsequent records if it's still alive when we start writing WAL.
 	 */
-	ShutdownWalRcv();
+	XLogShutdownWalRcv();
 
 	/*
 	 * Reset unlogged relations to the contents of their INIT fork. This is
@@ -7644,7 +7671,7 @@ StartupXLOG(void)
 	 * recovery, e.g., timeline history file) from archive or pg_wal.
 	 *
 	 * Note that standby mode must be turned off after killing WAL receiver,
-	 * i.e., calling ShutdownWalRcv().
+	 * i.e., calling XLogShutdownWalRcv().
 	 */
 	Assert(!WalRcvStreaming());
 	StandbyMode = false;
@@ -7710,6 +7737,14 @@ StartupXLOG(void)
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
 
 	/*
+	 * Allow ordinary WAL segment creation before any exitArchiveRecovery(),
+	 * which sometimes creates a segment, and after the last ReadRecord().
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = true;
+	LWLockRelease(ControlFileLock);
+
+	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
 	 * If we are doing an archive recovery, we always assign a new ID.  This
@@ -12378,7 +12413,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 						return false;
 					}
 
@@ -12426,7 +12461,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * WAL that we restore from archive.
 					 */
 					if (WalRcvStreaming())
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 
 					/*
 					 * Before we sleep, re-scan for possible new timelines if
@@ -12553,7 +12588,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (pendingWalRcvRestart && !startWalReceiver)
 					{
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 
 						/*
 						 * Re-scan for possible new timelines if we were
@@ -12603,6 +12638,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 									 tli, curFileTLI);
 						}
 						curFileTLI = tli;
+						LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+						XLogCtl->InstallXLogFileSegmentActive = true;
+						LWLockRelease(ControlFileLock);
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
@@ -12770,6 +12808,17 @@ StartupRequestWalReceiverRestart(void)
 	}
 }
 
+/* Thin wrapper around ShutdownWalRcv(). */
+static void
+XLogShutdownWalRcv(void)
+{
+	ShutdownWalRcv();
+
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = false;
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Determine what log level should be used to report a corrupt WAL record
  * in the current WAL page, previously read by XLogPageRead().

David Steele

david@pgmasters.net

over 3 years ago

In reply to: Noah Misch (#2)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

Hi Noah,

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1]/messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com.

I wonder if that would make it worth a back patch?

[1]: /messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com
/messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com

Noah Misch

noah@leadboat.com

over 3 years ago

In reply to: David Steele (#3)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

Show quoted text

[1] /messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com

David Steele

david@pgmasters.net

over 3 years ago

In reply to: Noah Misch (#4)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people
from wasting time tracking down a problem that has already been fixed.

However, there is another issue [1]/messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com that might argue for a back patch if
this patch (as I believe) would fix the issue.

Regards,
-David

[1]: /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com
/messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

Noah Misch

noah@leadboat.com

over 3 years ago

In reply to: David Steele (#5)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

David Steele

david@pgmasters.net

over 3 years ago

In reply to: Noah Misch (#6)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On 8/2/22 10:37, Noah Misch wrote:

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

Let's see if Don can confirm that he is seeing the "could not link file"
messages.

Regards,
-David

Don Seiler

don@seiler.us

over 3 years ago

In reply to: David Steele (#7)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:

That makes sense. Each iteration of the restartpoint recycle loop has a

1/N

chance of failing. Recovery adds >N files between restartpoints.

Hence, the

WAL directory grows without bound. Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

Let's see if Don can confirm that he is seeing the "could not link file"
messages.

During my latest incident, there was only one occurrence:

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

WAL restore/recovery seemed to continue on just fine then. And it would
continue on until the pg_wal volume ran out of space unless I was manually
rm'ing already-recovered WAL files from the side.

--
Don Seiler
www.seiler.us

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Don Seiler (#8)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in

On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:

That makes sense. Each iteration of the restartpoint recycle loop has a

1/N

chance of failing. Recovery adds >N files between restartpoints.

Hence, the

WAL directory grows without bound. Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

I'm not sure I understand it correctly, but isn't the cause of the
issue in the other thread due to skipping many checkpoint records
within the checkpoint_timeout? I remember that I proposed a GUC
variable to disable that checkpoint skipping. As another measure for
that issue, we could force replaying checkpoint if max_wal_size is
already filled up or known to be filled in the next checkpoint cycle.
If this is correct, this patch is irrelevant to the issue.

Let's see if Don can confirm that he is seeing the "could not link file"
messages.

During my latest incident, there was only one occurrence:

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

(I noticed that the patch in the other thread is broken:()

Hmm. It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

WAL restore/recovery seemed to continue on just fine then. And it would
continue on until the pg_wal volume ran out of space unless I was manually
rm'ing already-recovered WAL files from the side.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10

Noah Misch

noah@leadboat.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:

At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

Hmm. It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The
intent of find_free=true is to make it okay to pass a target xlog file that
ceases to be a good target. (InstallXLogFileSegment() searches for a good
target while holding ControlFileLock.) Can you say more about how that proved
to be insufficient?

#11

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Noah Misch (#10)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

At Wed, 3 Aug 2022 00:28:47 -0700, Noah Misch <noah@leadboat.com> wrote in

On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:

At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

Hmm. It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The
intent of find_free=true is to make it okay to pass a target xlog file that
ceases to be a good target. (InstallXLogFileSegment() searches for a good
target while holding ControlFileLock.) Can you say more about how that proved
to be insufficient?

Ug.. No. I can't. I was confused by something. Sorry.

PreallocXlogFiles() and checkpointer are mutually excluded by the same
lock, too.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

Noah Misch

noah@leadboat.com

over 3 years ago

In reply to: Noah Misch (#6)

1 attachment(s)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Aug 02, 2022 at 07:37:27AM -0700, Noah Misch wrote:

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

Attachments:

XLogFileInit6-hint-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Add HINT for restartpoint race with KeepFileRestoredFromArchive().
    
    The five commits ending at cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
    closed this race condition for v15+.  For v14 through v10, add a HINT to
    discourage studying the cosmetic problem.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20220731061747.GA3692882@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cdd01f..9c845b7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3480,7 +3480,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", path)));
+				 errmsg("could not open file \"%s\": %m", path),
+				 (AmCheckpointerProcess() ?
+				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
+				  0)));
 
 	elog(DEBUG2, "done creating and filling new WAL file");
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 6688dee..e76daff 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -834,7 +834,10 @@ durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
+						oldfile, newfile),
+				 (AmCheckpointerProcess() ?
+				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
+				  0)));
 		return -1;
 	}
 	unlink(oldfile);
@@ -844,7 +847,10 @@ durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
+						oldfile, newfile),
+				 (AmCheckpointerProcess() ?
+				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
+				  0)));
 		return -1;
 	}
 #endif

#13

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Noah Misch (#12)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

I agree to this.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14

David Steele

david@pgmasters.net

over 3 years ago

In reply to: Kyotaro Horiguchi (#13)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On 8/4/22 04:06, Kyotaro Horiguchi wrote:

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

+1

Looks good to me as well.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

I agree to this.

Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal
growing without bound. Even if we are not sure what is causing it, how
confident are we that the patches applied to v15 would fix it?

Regards,
-David

#15

Noah Misch

noah@leadboat.com

over 3 years ago

In reply to: David Steele (#14)

Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote:

On 8/4/22 04:06, Kyotaro Horiguchi wrote:

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

+1

Looks good to me as well.

Thanks for reviewing.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

I agree to this.

Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing
without bound. Even if we are not sure what is causing it, how confident are
we that the patches applied to v15 would fix it?

I'll say 7%; certainly too low to stop investigating. The next things I'd want
to see are:

- select name, setting, source from pg_settings where setting <> boot_val;
- log_checkpoints log entries, and other log entries having the same PID

If the theory about checkpoint skipping holds, there should be a period where
the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints
begin and 0 restartpoints complete.