avoid multiple hard links to same WAL file after a crash

Started by Nathan Bossartalmost 4 years ago59 messages
#1Nathan Bossart
nathandbossart@gmail.com
1 attachment(s)

Hi hackers,

I am splitting this off of a previous thread aimed at reducing archiving
overhead [0]/messages/by-id/20220222011948.GA3850532@nathanxps13, as I believe this fix might deserve back-patching.

Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1]/messages/by-id/20220222173711.GA3852671@nathanxps13 demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling. Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting. This seems likely to lead to WAL corruption.

The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling. This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.

This patch also sets the stage for reducing archiving overhead (as
discussed in the other thread [0]/messages/by-id/20220222011948.GA3850532@nathanxps13). The proposed change to reduce
archiving overhead will make it more likely that the server will attempt to
re-archive segments after a crash. This might lead to archive corruption
if the server concurrently writes to the same file via the aforementioned
bug.

[0]: /messages/by-id/20220222011948.GA3850532@nathanxps13
[1]: /messages/by-id/20220222173711.GA3852671@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Avoid-multiple-hard-links-to-same-WAL-file-after-.patchtext/x-diff; charset=us-asciiDownload
From 244726f6a78aca52c2fe6e70cef966f152057191 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Thu, 7 Apr 2022 10:07:42 -0700
Subject: [PATCH v1 1/1] Avoid multiple hard links to same WAL file after a
 crash.

Presently, WAL recycling uses durable_rename_excl(), which notes that a crash at
an unfortunate moment can result in two links to the same file.  My testing
demonstrated that it was possible to end up with two links to the same file in
pg_wal after a crash just before unlink() during WAL recycling.  Specifically,
the test produced links to the same file for the current WAL file and the next
one because the half-recycled WAL file was re-recycled upon restarting.  This
seems likely to lead to WAL corruption.

This change prevents this problem by using durable_rename() instead of
durable_rename_excl() for WAL recycling.  This removes the protection against
accidentally overwriting an existing WAL file, but there shouldn't be one.

Back-patch to all supported versions.

Author: Nathan Bossart
---
 src/backend/access/transam/xlog.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6770c3ddba..6ab5b2a622 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3324,13 +3324,15 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
+	 * Perform the rename.  Ideally, we'd use link() and unlink() to avoid
+	 * overwriting an existing file (there shouldn't be one).  However, that
+	 * approach opens up the possibility that pg_wal will contain multiple hard
+	 * links to the same WAL file after a crash.
 	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.25.1

#2Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#1)
Re: avoid multiple hard links to same WAL file after a crash

On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1] demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling. Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting. This seems likely to lead to WAL corruption.

Wow, that's bad.

The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling. This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.

I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.

--
Robert Haas
EDB: http://www.enterprisedb.com

#3Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#2)
Re: avoid multiple hard links to same WAL file after a crash

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:

I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.

IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at
all. Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.

I do agree that it's worth examining this stuff a bit closer. I've
frequently found myself trying to reason about all the different states
that callers of these functions can produce, so any changes that help
simplify matters are a win in my book.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#3)
Re: avoid multiple hard links to same WAL file after a crash

On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote:

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

Ha, so there are only a few callers of durable_rename_excl() in the
PostgreSQL tree. One is basic_archive.c, which is already doing a stat()
check. IIRC I only used durable_rename_excl() here to handle the case
where multiple servers are writing archives to the same location. If that
happened, the archiver process would begin failing. If a crash left two
hard links to the same file around, we will silently succeed the next time
around thanks to the compare_files() check. Besides the WAL installation
code, the only other callers are in timeline.c, and both note that the use
of durable_rename_excl() is for "paranoidly trying to avoid overwriting an
existing file (there shouldn't be one)."

So AFAICT basic_archive.c is the only caller with a strong reason for using
durable_rename_excl(), and even that might not be worth keeping it around.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#2)
1 attachment(s)
Re: avoid multiple hard links to same WAL file after a crash

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

I've attached a patch that simply removes durable_rename_excl() and
replaces existing calls with durable_rename(). I noticed that Andres
expressed similar misgivings about durable_rename_excl() last year [0]https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de [1]/messages/by-id/20210318023004.gz2aejhze2kkkqr2@alap3.anarazel.de.
I can create a stat-then-durable-rename version of this for back-patching
if that is still the route we want to go.

[0]: https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de
[1]: /messages/by-id/20210318023004.gz2aejhze2kkkqr2@alap3.anarazel.de

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Remove-durable_rename_excl.patchtext/x-diff; charset=us-asciiDownload
From d3c633e19555dc0cf98207ad5e7c08ab9ce85dc0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Fri, 8 Apr 2022 11:48:17 -0700
Subject: [PATCH v2 1/1] Remove durable_rename_excl().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such ovewrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive uses it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change removes durable_rename_excl() and replaces all existing
calls with durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.

Author: Nathan Bossart
Reviewed-by: Robert Haas
Discussion: https://postgr.es/m/20220407182954.GA1231544%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 ++-
 src/backend/access/transam/timeline.c | 14 +-----
 src/backend/access/transam/xlog.c     |  8 +---
 src/backend/storage/file/fd.c         | 63 ---------------------------
 src/include/pg_config_manual.h        |  7 ---
 src/include/storage/fd.h              |  1 -
 6 files changed, 7 insertions(+), 91 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..128f754e87 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -519,12 +514,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a7814d4019..d19215ab24 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3323,14 +3323,10 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..88645ed83d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -807,69 +807,6 @@ durable_unlink(const char *fname, int elevel)
 	return 0;
 }
 
-/*
- * durable_rename_excl -- rename a file in a durable manner.
- *
- * Similar to durable_rename(), except that this routine tries (but does not
- * guarantee) not to overwrite the target file.
- *
- * Note that a crash in an unfortunate moment can leave you with two links to
- * the target file.
- *
- * Log errors with the caller specified severity.
- *
- * On Windows, using a hard link followed by unlink() causes concurrency
- * issues, while a simple rename() does not cause that, so be careful when
- * changing the logic of this routine.
- *
- * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
- * valid upon return.
- */
-int
-durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
-{
-	/*
-	 * Ensure that, if we crash directly after the rename/link, a file with
-	 * valid contents is moved into place.
-	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
-
-#ifdef HAVE_WORKING_LINK
-	if (link(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-	unlink(oldfile);
-#else
-	if (rename(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-#endif
-
-	/*
-	 * Make change persistent in case of an OS crash, both the new entry and
-	 * its parent directory need to be flushed.
-	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
-
-	/* Same for parent directory */
-	if (fsync_parent_path(newfile, elevel) != 0)
-		return -1;
-
-	return 0;
-}
-
 /*
  * InitFileAccess --- initialize this module during backend startup
  *
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 84ce5a4a5d..830804fdfb 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -163,13 +163,6 @@
 #define USE_BARRIER_SMGRRELEASE
 #endif
 
-/*
- * Define this if your operating system supports link()
- */
-#if !defined(WIN32) && !defined(__CYGWIN__)
-#define HAVE_WORKING_LINK 1
-#endif
-
 /*
  * USE_POSIX_FADVISE controls whether Postgres will attempt to use the
  * posix_fadvise() kernel call.  Usually the automatic configure tests are
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 69549b000f..2b4a8e0ffe 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -187,7 +187,6 @@ extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
-- 
2.25.1

#6Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#3)
Re: avoid multiple hard links to same WAL file after a crash

On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:

I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.

IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at
all. Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().

Sure, but a guarantee that happens on only some systems isn't worth
much. And, if it comes at the price of potentially having multiple
hard links to the same file in obscure situations, that seems like it
could easily cause more problems than this whole scheme can ever hope
to solve.

I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

I wonder if this is really true. I thought rename() was supposed to be atomic.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Justin Pryzby
pryzby@telsasoft.com
In reply to: Robert Haas (#6)
Re: avoid multiple hard links to same WAL file after a crash

On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:

I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

I wonder if this is really true. I thought rename() was supposed to be atomic.

Looks like it's atomic in that it's not like cp + rm, but not atomic the other
way you want.

| If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However, there will probably be a window in which
| both oldpath and newpath refer to the file being renamed.

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#1)
Re: avoid multiple hard links to same WAL file after a crash

At Thu, 7 Apr 2022 11:29:54 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling. This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.

From another direction, if the new segment was the currently active
one, we just mustn't install it. Otherwise we don't care.

So, the only thing we need to care is segment switch. Without it, the
segment that InstallXLogFileSegment found by the stat loop is known to
be safe to overwrite even if exists.

When segment switch finds an existing file, it's no problem since the
segment switch doesn't create a new segment. Otherwise segment switch
always calls InstallXLogFileSegment. The section from searching for
an empty segmetn slot until calling durable_rename_excl() is protected
by ControlFileLock. Thus if a process is in the section, no other
process can switch to a newly-created segment.

If this diagnosis is correct, the comment is proved to be paranoid.

* Perform the rename using link if available, paranoidly trying to avoid
* overwriting an existing file (there shouldn't be one).

As the result, I think Nathan's fix is correct that we can safely use
durable_rename() instead.

And I propose to use renameat2 on Linux so that we can detect the
contradicting case by the regression tests even though only on Linux.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#9Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#8)
Re: avoid multiple hard links to same WAL file after a crash

On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

So, the only thing we need to care is segment switch. Without it, the
segment that InstallXLogFileSegment found by the stat loop is known to
be safe to overwrite even if exists.

When segment switch finds an existing file, it's no problem since the
segment switch doesn't create a new segment. Otherwise segment switch
always calls InstallXLogFileSegment. The section from searching for
an empty segmetn slot until calling durable_rename_excl() is protected
by ControlFileLock. Thus if a process is in the section, no other
process can switch to a newly-created segment.

If this diagnosis is correct, the comment is proved to be paranoid.

It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: avoid multiple hard links to same WAL file after a crash

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

If this diagnosis is correct, the comment is proved to be paranoid.

It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.

"git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)

regards, tom lane

#11Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#10)
Re: avoid multiple hard links to same WAL file after a crash

On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

If this diagnosis is correct, the comment is proved to be paranoid.

It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.

"git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)

I traced this back a while ago. I believe the link() was first added in
November 2000 as part of f0e37a8. This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#11)
Re: avoid multiple hard links to same WAL file after a crash

At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

If this diagnosis is correct, the comment is proved to be paranoid.

It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.

"git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)

I traced this back a while ago. I believe the link() was first added in
November 2000 as part of f0e37a8. This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.

f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
somwhere out of the ML.. This patch changed XLogFileInit to
supportusing existent files so that XLogWrite can use the new segment
provided by checkpoint and still allow XLogWrite to create a new
segment by itself.

Just before the commit, calls to XLogFileInit were protected (or
serialized) by logwr_lck. At the commit calls to the same function
were still serialized by ControlFileLockId.

I *guess* that Vadim faced/noticed a race condition when he added
checkpoint. Thus introduced the link+remove protocol but finally it
became useless by moving the call to XLogFileInit within
ControlFileLockId section. But, of course, all of story is mere a
guess.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#12)
Re: avoid multiple hard links to same WAL file after a crash

On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:

At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

I traced this back a while ago. I believe the link() was first added in
November 2000 as part of f0e37a8. This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.

f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
somwhere out of the ML.. This patch changed XLogFileInit to
supportusing existent files so that XLogWrite can use the new segment
provided by checkpoint and still allow XLogWrite to create a new
segment by itself.

Yeah, I've been unable to find any discussion besides a brief reference to
adding checkpointing [0]/messages/by-id/8F4C99C66D04D4118F580090272A7A23018D85@sectorbase1.sectorbase.com.

[0]: /messages/by-id/8F4C99C66D04D4118F580090272A7A23018D85@sectorbase1.sectorbase.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#14Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#6)
Re: avoid multiple hard links to same WAL file after a crash

On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:

On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

I wonder if this is really true. I thought rename() was supposed to be atomic.

Not always. For example, some old versions of MacOS have a non-atomic
implementation of rename(), like prairiedog with 10.4. Even 10.5 does
not handle atomicity as far as I call. In short, it looks like a bad
idea to me to rely on this idea at all. Some FSes have their own way
of handling things, as well, but I am not much into this world.

Saying that, it would be nice to see durable_rename_excl() gone as it
has created quite a bit of pain for us in the past years.
--
Michael

#15Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#14)
Re: avoid multiple hard links to same WAL file after a crash

On Mon, Apr 18, 2022 at 04:48:35PM +0900, Michael Paquier wrote:

Saying that, it would be nice to see durable_rename_excl() gone as it
has created quite a bit of pain for us in the past years.

Yeah, I think this is the right thing to do. Patch upthread [0]/messages/by-id/20220408194345.GA1541826@nathanxps13.

For back-branches, I suspect we'll want to remove all uses of
durable_rename_excl() but leave the function around for any extensions that
are using it. Of course, we'd also need a big comment imploring folks not
to add any more callers. Another option would be to change the behavior of
durable_rename_excl() to something that we think is safer (e.g., stat then
rename), but that might just introduce a different set of problems.

[0]: /messages/by-id/20220408194345.GA1541826@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#14)
Re: avoid multiple hard links to same WAL file after a crash

Michael Paquier <michael@paquier.xyz> writes:

On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:

I wonder if this is really true. I thought rename() was supposed to be atomic.

Not always. For example, some old versions of MacOS have a non-atomic
implementation of rename(), like prairiedog with 10.4. Even 10.5 does
not handle atomicity as far as I call.

I think that's not talking about the same thing. POSIX requires rename(2)
to replace an existing target link atomically:

If the link named by the new argument exists, it shall be removed and
old renamed to new. In this case, a link named new shall remain
visible to other threads throughout the renaming operation and refer
either to the file referred to by new or old before the operation
began.

(It's that requirement that ancient macOS fails to meet.)

However, I do not see any text that addresses the question of whether
the old link disappears atomically with the appearance of the new link,
and it seems like that'd be pretty impractical to ensure in cases like
moving a link from one directory to another. (What would it even mean
to say that, considering that a thread can't read the two directories
at the same instant?) From a crash-safety standpoint, it'd surely be
better to make the new link before removing the old, so I imagine
that's what most file systems do.

regards, tom lane

#17Greg Stark
stark@mit.edu
In reply to: Tom Lane (#16)
Re: avoid multiple hard links to same WAL file after a crash

The readdir interface allows processes to be in the middle of reading
a directory and unless a kernel was happy to either materialize the
entire directory list when the readdir starts, or lock the entire
directory against modification for the entire time the a process has a
readdir fd open it's always going to be possible for the a process to
have previously read the old directory entry and later see the new
directory entry. Kernels don't do any MVCC or cmin type of games so
they're not going to be able to prevent it.

What's worse of course is that it may only happen in very large
directories. Most directories fit on a single block and readdir may
buffer up all the entries a block at a time for efficiency. So it may
only be visible on very large directories that span multiple blocks.

#18Nathan Bossart
nathandbossart@gmail.com
In reply to: Greg Stark (#17)
2 attachment(s)
Re: avoid multiple hard links to same WAL file after a crash

Here is an attempt at creating something that can be back-patched. 0001
simply replaces calls to durable_rename_excl() with durable_rename() and is
intended to be back-patched. 0002 removes the definition of
durable_rename_excl() and is _not_ intended for back-patching. I imagine
0002 will need to be held back for v16devel.

I think back-patching 0001 will encounter a couple of small obstacles. For
example, the call in basic_archive won't exist on most of the
back-branches, and durable_rename_excl() was named durable_link_or_rename()
before v13. I don't mind producing a patch for each back-branch if needed.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Replace-calls-to-durable_rename_excl-with-durable.patchtext/x-diff; charset=us-asciiDownload
From d489c2bff029db6e07e5028788faf869c35f886b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Tue, 26 Apr 2022 11:56:50 -0700
Subject: [PATCH v3 1/2] Replace calls to durable_rename_excl() with
 durable_rename().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such overwrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change replaces all calls to durable_rename_excl() with
durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.
The function itself is left around in case any extensions are using
it.  It will be removed in v16 via a follow-up commit.

Back-patch to all supported versions.  Before v13,
durable_rename_excl() was named durable_link_or_rename().

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 +++--
 src/backend/access/transam/timeline.c | 14 ++------------
 src/backend/access/transam/xlog.c     |  8 ++------
 3 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..128f754e87 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -519,12 +514,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61cda56c6f..f49194a8b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3323,14 +3323,10 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.25.1

v3-0002-Remove-durable_rename_excl.patchtext/x-diff; charset=us-asciiDownload
From 398350968f35f0974f1668e06be1adad4a7f7e3c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Tue, 26 Apr 2022 12:38:23 -0700
Subject: [PATCH v3 2/2] Remove durable_rename_excl().

A previous commit replaced all calls to this function with
durable_rename(), but the function itself was not removed in back-
branches since extensions may use it.  This change removes the
function from v16devel.

Do not back-patch.

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 src/backend/storage/file/fd.c  | 63 ----------------------------------
 src/include/pg_config_manual.h |  7 ----
 src/include/storage/fd.h       |  1 -
 3 files changed, 71 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 24704b6a02..f904f60c08 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -807,69 +807,6 @@ durable_unlink(const char *fname, int elevel)
 	return 0;
 }
 
-/*
- * durable_rename_excl -- rename a file in a durable manner.
- *
- * Similar to durable_rename(), except that this routine tries (but does not
- * guarantee) not to overwrite the target file.
- *
- * Note that a crash in an unfortunate moment can leave you with two links to
- * the target file.
- *
- * Log errors with the caller specified severity.
- *
- * On Windows, using a hard link followed by unlink() causes concurrency
- * issues, while a simple rename() does not cause that, so be careful when
- * changing the logic of this routine.
- *
- * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
- * valid upon return.
- */
-int
-durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
-{
-	/*
-	 * Ensure that, if we crash directly after the rename/link, a file with
-	 * valid contents is moved into place.
-	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
-
-#ifdef HAVE_WORKING_LINK
-	if (link(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-	unlink(oldfile);
-#else
-	if (rename(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-#endif
-
-	/*
-	 * Make change persistent in case of an OS crash, both the new entry and
-	 * its parent directory need to be flushed.
-	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
-
-	/* Same for parent directory */
-	if (fsync_parent_path(newfile, elevel) != 0)
-		return -1;
-
-	return 0;
-}
-
 /*
  * InitFileAccess --- initialize this module during backend startup
  *
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 84ce5a4a5d..830804fdfb 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -163,13 +163,6 @@
 #define USE_BARRIER_SMGRRELEASE
 #endif
 
-/*
- * Define this if your operating system supports link()
- */
-#if !defined(WIN32) && !defined(__CYGWIN__)
-#define HAVE_WORKING_LINK 1
-#endif
-
 /*
  * USE_POSIX_FADVISE controls whether Postgres will attempt to use the
  * posix_fadvise() kernel call.  Usually the automatic configure tests are
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 69549b000f..2b4a8e0ffe 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -187,7 +187,6 @@ extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
-- 
2.25.1

#19Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#18)
Re: avoid multiple hard links to same WAL file after a crash

On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote:

Here is an attempt at creating something that can be back-patched. 0001
simply replaces calls to durable_rename_excl() with durable_rename() and is
intended to be back-patched. 0002 removes the definition of
durable_rename_excl() and is _not_ intended for back-patching. I imagine
0002 will need to be held back for v16devel.

I would not mind applying 0002 on HEAD now to avoid more uses of this
API, and I can get behind 0001 after thinking more about it.

I think back-patching 0001 will encounter a couple of small obstacles. For
example, the call in basic_archive won't exist on most of the
back-branches, and durable_rename_excl() was named durable_link_or_rename()
before v13. I don't mind producing a patch for each back-branch if needed.

I am not sure that have any need to backpatch this change based on the
unlikeliness of the problem, TBH. One thing that is itching me a bit,
like Robert upthread, is that we don't check anymore that the newfile
does not exist in the code paths because we never expect one. It is
possible to use stat() for that. But access() within a simple
assertion would be simpler? Say something like:
Assert(access(path, F_OK) != 0 && errno == ENOENT);

The case for basic_archive is limited as the comment of the patch
states, but that would be helpful for the two calls in timeline.c and
the one in xlog.c in the long-term. And this has no need to be part
of fd.c, this can be added before the durable_rename() calls. What do
you think?
--
Michael

#20Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#19)
2 attachment(s)
Re: avoid multiple hard links to same WAL file after a crash

On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote:

I am not sure that have any need to backpatch this change based on the
unlikeliness of the problem, TBH. One thing that is itching me a bit,
like Robert upthread, is that we don't check anymore that the newfile
does not exist in the code paths because we never expect one. It is
possible to use stat() for that. But access() within a simple
assertion would be simpler? Say something like:
Assert(access(path, F_OK) != 0 && errno == ENOENT);

The case for basic_archive is limited as the comment of the patch
states, but that would be helpful for the two calls in timeline.c and
the one in xlog.c in the long-term. And this has no need to be part
of fd.c, this can be added before the durable_rename() calls. What do
you think?

Here is a new patch set with these assertions added. I think at least the
xlog.c change ought to be back-patched. The problem may be unlikely, but
AFAICT the possible consequences include WAL corruption.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Replace-calls-to-durable_rename_excl-with-durable.patchtext/x-diff; charset=us-asciiDownload
From 8ffc337621f8a287350a7a55256b58b0585f7a1f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Tue, 26 Apr 2022 11:56:50 -0700
Subject: [PATCH v4 1/2] Replace calls to durable_rename_excl() with
 durable_rename().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such overwrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change replaces all calls to durable_rename_excl() with
durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.
The function itself is left around in case any extensions are using
it.  It will be removed in v16 via a follow-up commit.

Back-patch to all supported versions.  Before v13,
durable_rename_excl() was named durable_link_or_rename().

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 +++--
 src/backend/access/transam/timeline.c | 16 ++++------------
 src/backend/access/transam/xlog.c     |  9 +++------
 3 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..f3a8e53aa4 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -519,12 +515,8 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61cda56c6f..76ec80c950 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3323,14 +3323,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.25.1

v4-0002-Remove-durable_rename_excl.patchtext/x-diff; charset=us-asciiDownload
From 1d8e3c2040d1acf04e8c14637debf0dc12d500d2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Tue, 26 Apr 2022 12:38:23 -0700
Subject: [PATCH v4 2/2] Remove durable_rename_excl().

A previous commit replaced all calls to this function with
durable_rename(), but the function itself was not removed in back-
branches since extensions may use it.  This change removes the
function from v16devel.

Do not back-patch.

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 src/backend/storage/file/fd.c  | 63 ----------------------------------
 src/include/pg_config_manual.h |  7 ----
 src/include/storage/fd.h       |  1 -
 3 files changed, 71 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 24704b6a02..f904f60c08 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -807,69 +807,6 @@ durable_unlink(const char *fname, int elevel)
 	return 0;
 }
 
-/*
- * durable_rename_excl -- rename a file in a durable manner.
- *
- * Similar to durable_rename(), except that this routine tries (but does not
- * guarantee) not to overwrite the target file.
- *
- * Note that a crash in an unfortunate moment can leave you with two links to
- * the target file.
- *
- * Log errors with the caller specified severity.
- *
- * On Windows, using a hard link followed by unlink() causes concurrency
- * issues, while a simple rename() does not cause that, so be careful when
- * changing the logic of this routine.
- *
- * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
- * valid upon return.
- */
-int
-durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
-{
-	/*
-	 * Ensure that, if we crash directly after the rename/link, a file with
-	 * valid contents is moved into place.
-	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
-
-#ifdef HAVE_WORKING_LINK
-	if (link(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-	unlink(oldfile);
-#else
-	if (rename(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-#endif
-
-	/*
-	 * Make change persistent in case of an OS crash, both the new entry and
-	 * its parent directory need to be flushed.
-	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
-
-	/* Same for parent directory */
-	if (fsync_parent_path(newfile, elevel) != 0)
-		return -1;
-
-	return 0;
-}
-
 /*
  * InitFileAccess --- initialize this module during backend startup
  *
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 84ce5a4a5d..830804fdfb 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -163,13 +163,6 @@
 #define USE_BARRIER_SMGRRELEASE
 #endif
 
-/*
- * Define this if your operating system supports link()
- */
-#if !defined(WIN32) && !defined(__CYGWIN__)
-#define HAVE_WORKING_LINK 1
-#endif
-
 /*
  * USE_POSIX_FADVISE controls whether Postgres will attempt to use the
  * posix_fadvise() kernel call.  Usually the automatic configure tests are
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 69549b000f..2b4a8e0ffe 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -187,7 +187,6 @@ extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
-- 
2.25.1

#21Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#20)
Re: avoid multiple hard links to same WAL file after a crash

On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote:

Here is a new patch set with these assertions added. I think at least the
xlog.c change ought to be back-patched. The problem may be unlikely, but
AFAICT the possible consequences include WAL corruption.

Okay, so I have applied this stuff this morning to see what the
buildfarm had to say, and we have finished with a set of failures in
various buildfarm members:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&amp;dt=2022-04-28%2002%3A13%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&amp;dt=2022-04-28%2002%3A14%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&amp;dt=2022-04-28%2002%3A59%3A26

All of them did not like the part where we assume that a TLI history
file written by a WAL receiver should not exist beforehand, but as
025_stuck_on_old_timeline.pl is showing, a standby may attempt to
retrieve a TLI history file after getting it from the archives.

I was analyzing the whole thing, and it looks like a race condition.
Per the the buildfarm logs, we have less than 5ms between the moment
the startup process retrieves the history file of TLI 2 from the
archives and the moment the WAL receiver decides to check if this TLI
file exists. If it does not exist, it would then retrieve it from the
primary via streaming. So I guess that the sequence of events is
that:
- In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the
existence of the history file for TLI 2, does not find it.
- The startup process retrieves the file from the archives.
- The WAL receiver goes through the internal loop of
WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the
primary's stream.

Switching from durable_rename_excl() to durable_rename() would mean
that we'd overwrite the TLI file received from the primary stream over
what's been retrieved from the archives. That does not strike me as
an issue in itself and that should be safe, so the comment is
misleading, and we can live without the assertion in
writeTimeLineHistoryFile() called by the WAL receiver. Now, I think
that we'd better keep some belts in writeTimeLineHistory() called by
the startup process at the end-of-recovery as I should never ever have
a TLI file generated when selecting a new timeline. Perhaps this
should be a elog(ERROR) at least, with a check on the file existence
before calling durable_rename()?

Anyway, my time is constrained next week due to the upcoming Japanese
Golden Week and the buildfarm has to be stable, so I have reverted the
change for now.
--
Michael

#22Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#13)
Re: avoid multiple hard links to same WAL file after a crash

On Tue, Apr 12, 2022 at 09:27:42AM -0700, Nathan Bossart wrote:

On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:

At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

I traced this back a while ago. I believe the link() was first added in
November 2000 as part of f0e37a8. This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.

f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
somwhere out of the ML.. This patch changed XLogFileInit to
supportusing existent files so that XLogWrite can use the new segment
provided by checkpoint and still allow XLogWrite to create a new
segment by itself.

Yes, I think that you are right here. I also suspect that the
checkpoint command was facing a concurrency issue while working on
the feature and that Vadim saw that this part of the implementation
would be safer in the long run if we use link() followed by unlink().

Yeah, I've been unable to find any discussion besides a brief reference to
adding checkpointing [0].

[0] /messages/by-id/8F4C99C66D04D4118F580090272A7A23018D85@sectorbase1.sectorbase.com

While looking at the history of this area, I have also noticed this
argument, telling also that this is a safety measure if this code were
to run in parallel, but that's without counting on the control file
lock hold while doing this operation anyway:
/messages/by-id/24974.982597735@sss.pgh.pa.us

As mentioned already upthread, f0e37a8 is the origin of the
link()/unlink() business in the WAL segment initialization logic, and
also note 1f159e5 that has added a rename() as extra code path for
systems where link() was not working.

At the end, switching directly from durable_rename_excl() to
durable_rename() should be fine for the WAL segment initialization,
but we could do things a bit more carefully by adding a check on the
file existence before calling durable_rename() and issue a elog(LOG)
if a file is found, giving a mean for the WAL recycling to give up
peacefully as it does now. Per my analysis, the TLI history file
created at the end of recovery ought to issue an elog(ERROR).

Now, I am surprised by the third code path of durable_rename_excl(),
as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
any issues, as link() should exit with EEXIST when the startup process
grabs the same history file concurrently. It seems to me that in this
last case using durable_rename() could be an improvement and prevent
extra WAL receiver restarts as a TLI history fetched from the primary
via streaming or from some archives should be the same, but we could
be more careful, like the WAL init logic, by skipping the
durable_rename() and issuing an elog(LOG). That would not be perfect,
still a bit better than the current state of HEAD.

As we are getting closer to the beta release, it looks safer to let
this change aside a bit longer and wait for v16 to be opened for
business on HEAD.
--
Michael

#23Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#22)
Re: avoid multiple hard links to same WAL file after a crash

On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote:

Now, I am surprised by the third code path of durable_rename_excl(),
as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
any issues, as link() should exit with EEXIST when the startup process
grabs the same history file concurrently. It seems to me that in this
last case using durable_rename() could be an improvement and prevent
extra WAL receiver restarts as a TLI history fetched from the primary
via streaming or from some archives should be the same, but we could
be more careful, like the WAL init logic, by skipping the
durable_rename() and issuing an elog(LOG). That would not be perfect,
still a bit better than the current state of HEAD.

Skimming through at the buildfarm logs, it happens that the tests are
able to see this race from time to time. Here is one such example on
rorqual:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&amp;dt=2022-04-20%2004%3A47%3A58&amp;stg=recovery-check

And here are the relevant logs:
2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG: restored log
file "00000002.history" from archive
2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG: fetching
timeline history file for timeline 2 from primary server
2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL: could
not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history":
File exists
[...]
2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG: started
streaming WAL from primary at 0/3000000 on timeline 2

The WAL receiver upgrades the ERROR to a FATAL, and restarts
streaming shortly after. Using durable_rename() would not be an issue
here.
--
Michael

#24Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#22)
Re: avoid multiple hard links to same WAL file after a crash

On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote:

At the end, switching directly from durable_rename_excl() to
durable_rename() should be fine for the WAL segment initialization,
but we could do things a bit more carefully by adding a check on the
file existence before calling durable_rename() and issue a elog(LOG)
if a file is found, giving a mean for the WAL recycling to give up
peacefully as it does now. Per my analysis, the TLI history file
created at the end of recovery ought to issue an elog(ERROR).

My only concern with this approach is that it inevitably introduces a race
condition. In most cases, the file existence check will prevent
overwrites, but it might not always. Furthermore, we believe that such
overwrites either 1) should not happen (e.g., WAL recycling) or 2) won't
cause problems if they happen (e.g., when the WAL receiver writes the TLI
history file). Also, these races will be difficult to test, so we won't
know what breaks when they occur.

My instinct is to just let the overwrites happen. That way, we are more
likely to catch breakage in tests, and we'll have one less race condition
to worry about. I don't mind asserting that the file doesn't exist when we
don't expect it to, as that might help catch potential problems in
development without affecting behavior in production. If we do want to
add file existence checks, I think we'd better add a comment about the
potential for race conditions.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#23)
Re: avoid multiple hard links to same WAL file after a crash

On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote:

Skimming through at the buildfarm logs, it happens that the tests are
able to see this race from time to time. Here is one such example on
rorqual:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&amp;dt=2022-04-20%2004%3A47%3A58&amp;stg=recovery-check

And here are the relevant logs:
2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG: restored log
file "00000002.history" from archive
2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG: fetching
timeline history file for timeline 2 from primary server
2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL: could
not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history":
File exists
[...]
2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG: started
streaming WAL from primary at 0/3000000 on timeline 2

The WAL receiver upgrades the ERROR to a FATAL, and restarts
streaming shortly after. Using durable_rename() would not be an issue
here.

Thanks for investigating this one. I think I agree that we should simply
switch to durable_rename() (without a file existence check beforehand).

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#26Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#25)
2 attachment(s)
Re: avoid multiple hard links to same WAL file after a crash

On Mon, May 02, 2022 at 10:39:07AM -0700, Nathan Bossart wrote:

On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote:

The WAL receiver upgrades the ERROR to a FATAL, and restarts
streaming shortly after. Using durable_rename() would not be an issue
here.

Thanks for investigating this one. I think I agree that we should simply
switch to durable_rename() (without a file existence check beforehand).

Here is a new patch set. For now, I've only removed the file existence
check in writeTimeLineHistoryFile(). I don't know if I'm totally convinced
that there isn't a problem here (e.g., due to concurrent .ready file
creation), but since some platforms have been using rename() for some time,
I don't know how worried we should be. I thought about adding some kind of
locking between the WAL receiver and startup processes, but that seems
excessive. Alternatively, we could just fix xlog.c as proposed earlier
[0]: /messages/by-id/20220407182954.GA1231544@nathanxps13
the multiple-hard-link issue. All other callers are simply renaming a
temporary file into place, and the temporary file can be discarded if left
behind after a crash.

[0]: /messages/by-id/20220407182954.GA1231544@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Replace-calls-to-durable_rename_excl-with-durable.patchtext/x-diff; charset=us-asciiDownload
From 312860b4dc3794428c1e31528c05cf4b1ad06f00 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Tue, 26 Apr 2022 11:56:50 -0700
Subject: [PATCH v5 1/2] Replace calls to durable_rename_excl() with
 durable_rename().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such overwrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change replaces all calls to durable_rename_excl() with
durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.
The function itself is left around in case any extensions are using
it.  It will be removed in v16 via a follow-up commit.

Back-patch to all supported versions.  Before v13,
durable_rename_excl() was named durable_link_or_rename().

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 +++--
 src/backend/access/transam/timeline.c | 18 +++++-------------
 src/backend/access/transam/xlog.c     |  9 +++------
 3 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..e0a2a8ea68 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -516,15 +512,11 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
 	/*
-	 * Now move the completed history file into place with its final name.
+	 * Now move the completed history file into place with its final name,
+	 * replacing any existing file with the same name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61cda56c6f..76ec80c950 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3323,14 +3323,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.25.1

v5-0002-Remove-durable_rename_excl.patchtext/x-diff; charset=us-asciiDownload
From b239f1be33ea0e54b8e661312df52af49c090882 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Tue, 26 Apr 2022 12:38:23 -0700
Subject: [PATCH v5 2/2] Remove durable_rename_excl().

A previous commit replaced all calls to this function with
durable_rename(), but the function itself was not removed in back-
branches since extensions may use it.  This change removes the
function from v16devel.

Do not back-patch.

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 src/backend/storage/file/fd.c  | 63 ----------------------------------
 src/include/pg_config_manual.h |  7 ----
 src/include/storage/fd.h       |  1 -
 3 files changed, 71 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 24704b6a02..f904f60c08 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -807,69 +807,6 @@ durable_unlink(const char *fname, int elevel)
 	return 0;
 }
 
-/*
- * durable_rename_excl -- rename a file in a durable manner.
- *
- * Similar to durable_rename(), except that this routine tries (but does not
- * guarantee) not to overwrite the target file.
- *
- * Note that a crash in an unfortunate moment can leave you with two links to
- * the target file.
- *
- * Log errors with the caller specified severity.
- *
- * On Windows, using a hard link followed by unlink() causes concurrency
- * issues, while a simple rename() does not cause that, so be careful when
- * changing the logic of this routine.
- *
- * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
- * valid upon return.
- */
-int
-durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
-{
-	/*
-	 * Ensure that, if we crash directly after the rename/link, a file with
-	 * valid contents is moved into place.
-	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
-
-#ifdef HAVE_WORKING_LINK
-	if (link(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-	unlink(oldfile);
-#else
-	if (rename(oldfile, newfile) < 0)
-	{
-		ereport(elevel,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						oldfile, newfile)));
-		return -1;
-	}
-#endif
-
-	/*
-	 * Make change persistent in case of an OS crash, both the new entry and
-	 * its parent directory need to be flushed.
-	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
-
-	/* Same for parent directory */
-	if (fsync_parent_path(newfile, elevel) != 0)
-		return -1;
-
-	return 0;
-}
-
 /*
  * InitFileAccess --- initialize this module during backend startup
  *
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 84ce5a4a5d..830804fdfb 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -163,13 +163,6 @@
 #define USE_BARRIER_SMGRRELEASE
 #endif
 
-/*
- * Define this if your operating system supports link()
- */
-#if !defined(WIN32) && !defined(__CYGWIN__)
-#define HAVE_WORKING_LINK 1
-#endif
-
 /*
  * USE_POSIX_FADVISE controls whether Postgres will attempt to use the
  * posix_fadvise() kernel call.  Usually the automatic configure tests are
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 69549b000f..2b4a8e0ffe 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -187,7 +187,6 @@ extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
-- 
2.25.1

#27Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#26)
Re: avoid multiple hard links to same WAL file after a crash

On Mon, May 02, 2022 at 04:06:13PM -0700, Nathan Bossart wrote:

Here is a new patch set. For now, I've only removed the file existence
check in writeTimeLineHistoryFile(). I don't know if I'm totally convinced
that there isn't a problem here (e.g., due to concurrent .ready file
creation), but since some platforms have been using rename() for some time,
I don't know how worried we should be.

That's only about Windows these days, meaning that there is much less
coverage in this code path.

I thought about adding some kind of
locking between the WAL receiver and startup processes, but that seems
excessive.

Agreed.

Alternatively, we could just fix xlog.c as proposed earlier
[0]. AFAICT that is the only caller that can experience problems due to
the multiple-hard-link issue. All other callers are simply renaming a
temporary file into place, and the temporary file can be discarded if left
behind after a crash.

I'd agree with removing all the callers at the end. pgrename() is
quite robust on Windows, but I'd keep the two checks in
writeTimeLineHistory(), as the logic around findNewestTimeLine() would
consider a past TLI history file as in-use even if we have a crash
just after the file got created in the same path by the same standby,
and the WAL segment init part. Your patch does that.
--
Michael

#28Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#27)
Re: avoid multiple hard links to same WAL file after a crash

On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:

I'd agree with removing all the callers at the end. pgrename() is
quite robust on Windows, but I'd keep the two checks in
writeTimeLineHistory(), as the logic around findNewestTimeLine() would
consider a past TLI history file as in-use even if we have a crash
just after the file got created in the same path by the same standby,
and the WAL segment init part. Your patch does that.

As v16 is now open for business, I have revisited this change and
applied 0001 to change all the callers (aka removal of the assertion
for the WAL receiver when it overwrites a TLI history file). The
commit log includes details about the reasoning of all the areas
changed, for clarity, as of the WAL recycling part, the TLI history
file part and basic_archive.
--
Michael

#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#28)
Re: avoid multiple hard links to same WAL file after a crash

On Tue, Jul 05, 2022 at 10:19:49AM +0900, Michael Paquier wrote:

On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:

I'd agree with removing all the callers at the end. pgrename() is
quite robust on Windows, but I'd keep the two checks in
writeTimeLineHistory(), as the logic around findNewestTimeLine() would
consider a past TLI history file as in-use even if we have a crash
just after the file got created in the same path by the same standby,
and the WAL segment init part. Your patch does that.

As v16 is now open for business, I have revisited this change and
applied 0001 to change all the callers (aka removal of the assertion
for the WAL receiver when it overwrites a TLI history file). The
commit log includes details about the reasoning of all the areas
changed, for clarity, as of the WAL recycling part, the TLI history
file part and basic_archive.

Thanks! I wonder if we should add a comment in writeTimeLineHistoryFile()
about possible concurrent use by a WAL receiver and the startup process and
why that is okay.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#30Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#29)
Re: avoid multiple hard links to same WAL file after a crash

On Tue, Jul 05, 2022 at 09:58:38AM -0700, Nathan Bossart wrote:

Thanks! I wonder if we should add a comment in writeTimeLineHistoryFile()
about possible concurrent use by a WAL receiver and the startup process and
why that is okay.

Agreed. Adding an extra note at the top of the routine would help in
the future.
--
Michael

#31Robert Pang
robertpang@google.com
In reply to: Michael Paquier (#28)
Back-patch of: avoid multiple hard links to same WAL file after a crash

Dear team,

We recently observed a few cases where Postgres running on Linux
encountered an issue with WAL segment files. Specifically, two WAL
segments were linked to the same physical file after Postgres ran out
of memory and the OOM killer terminated one of its processes. This
resulted in the WAL segments overwriting each other and Postgres
failing a later recovery.

We found this fix [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3 that has been applied to Postgres 16, but the
cases we observed were running Postgres 15. Given that older major
versions will be supported for a good number of years, and the
potential for irrecoverability exists (even if rare), we would like to
discuss the possibility of back-patching this fix.

Are there any technical reasons not to back-patch this fix to older
major versions?

Thank you for your consideration.

Sincerely,
Robert Pang

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3

Show quoted text

On Sat, May 7, 2022 at 1:19 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:

I'd agree with removing all the callers at the end. pgrename() is
quite robust on Windows, but I'd keep the two checks in
writeTimeLineHistory(), as the logic around findNewestTimeLine() would
consider a past TLI history file as in-use even if we have a crash
just after the file got created in the same path by the same standby,
and the WAL segment init part. Your patch does that.

As v16 is now open for business, I have revisited this change and
applied 0001 to change all the callers (aka removal of the assertion
for the WAL receiver when it overwrites a TLI history file). The
commit log includes details about the reasoning of all the areas
changed, for clarity, as of the WAL recycling part, the TLI history
file part and basic_archive.
--
Michael

#32Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Pang (#31)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:

We recently observed a few cases where Postgres running on Linux
encountered an issue with WAL segment files. Specifically, two WAL
segments were linked to the same physical file after Postgres ran out
of memory and the OOM killer terminated one of its processes. This
resulted in the WAL segments overwriting each other and Postgres
failing a later recovery.

Yikes!

We found this fix [1] that has been applied to Postgres 16, but the
cases we observed were running Postgres 15. Given that older major
versions will be supported for a good number of years, and the
potential for irrecoverability exists (even if rare), we would like to
discuss the possibility of back-patching this fix.

IMHO this is a good time to reevaluate. It looks like we originally didn't
back-patch out of an abundance of caution, but now that this one has had
time to bake, I think it's worth seriously considering, especially now that
we have a report from the field.

--
nathan

#33Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#32)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

Hi,

On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote:

On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:

We recently observed a few cases where Postgres running on Linux
encountered an issue with WAL segment files. Specifically, two WAL
segments were linked to the same physical file after Postgres ran out
of memory and the OOM killer terminated one of its processes. This
resulted in the WAL segments overwriting each other and Postgres
failing a later recovery.

Yikes!

Indeed. As chance would have it, I was asked for input on a corrupted server
*today*. Eventually we found that recovery stopped early, after encountering a
segment with a *newer* pageaddr than we expected. Which made me think of this
issue, and indeed, the file recovery stopped at had two links. Before that
the server had been crashing on a regular basis for unrelated reasons, which
presumably increased the chances sufficiently to eventually hit this problem.

It's a normal thing to discover the end of the WAL by finding a segment that
has an older pageaddr than its name suggests. But in this case we saw a newer
page address. I wonder if we should treat that differently...

We found this fix [1] that has been applied to Postgres 16, but the
cases we observed were running Postgres 15. Given that older major
versions will be supported for a good number of years, and the
potential for irrecoverability exists (even if rare), we would like to
discuss the possibility of back-patching this fix.

IMHO this is a good time to reevaluate. It looks like we originally didn't
back-patch out of an abundance of caution, but now that this one has had
time to bake, I think it's worth seriously considering, especially now that
we have a report from the field.

Strongly agreed.

I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message. The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server. And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.

Greetings,

Andres Freund

#34Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#33)
3 attachment(s)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Wed, Dec 18, 2024 at 08:51:20PM -0500, Andres Freund wrote:

I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message. The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server. And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.

Looks so, yep. Your timing and the report's timing are interesting.

I've been double-checking the code to refresh myself with the problem,
and I don't see a reason to not apply something like the attached set
down to v13 for all these remaining branches (minus an edit of the
commit message).

Thoughts?
--
Michael

Attachments:

0001-Replace-durable_rename_excl-by-durable_rename-ta-v15.patchtext/x-diff; charset=us-asciiDownload
From 168c7071541d231fc13aee424709524f2ed4976e Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 5 Jul 2022 10:16:12 +0900
Subject: [PATCH] Replace durable_rename_excl() by durable_rename(), take two

durable_rename_excl() attempts to avoid overwriting any existing files
by using link() and unlink(), and it falls back to rename() on some
platforms (aka WIN32), which offers no such overwrite protection.  Most
callers use durable_rename_excl() just in case there is an existing
file, but in practice there shouldn't be one (see below for more
details).

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  As per Nathan's tests, it is
possible to end up with two links to the same file in pg_wal after a
crash just before unlink() during WAL recycling.  Specifically, the test
produced links to the same file for the current WAL file and the next
one because the half-recycled WAL file was re-recycled upon restarting,
leading to WAL corruption.

This change replaces all the calls of durable_rename_excl() to
durable_rename().  This removes the protection against accidentally
overwriting an existing file, but some platforms are already living
without it and ordinarily there shouldn't be one.  The function itself
is left around in case any extensions are using it.  It will be removed
on HEAD via a follow-up commit.

Here is a summary of the existing callers of durable_rename_excl() (see
second discussion link at the bottom), replaced by this commit.  First,
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.  Second, xlog.c uses it to recycle
past WAL segments, where an overwrite should not happen (origin of the
change at f0e37a8) because there are protections about the WAL segment
to select when recycling an entry.  The third and last area is related
to the write of timeline history files.  writeTimeLineHistory() will
write a new timeline history file at the end of recovery on promotion,
so there should be no such files for the same timeline.
What remains is writeTimeLineHistoryFile(), that can be used in parallel
by a WAL receiver and the startup process, and some digging of the
buildfarm shows that EEXIST from a WAL receiver can happen with an error
of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\",
which would cause an automatic restart of the WAL receiver as it is
promoted to FATAL, hence this should improve the stability of the WAL
receiver as rename() would overwrite an existing TLI history file
already fetched by the startup process at recovery.

This is a bug fix, but knowing the unlikeliness of the problem involving
one or more crashes at an exceptionally bad moment, no backpatch is
done.  Also, I want to be careful with such changes (aaa3aed did the
opposite of this change by removing HAVE_WORKING_LINK so as Windows
would do a link() rather than a rename() but this was not
concurrent-safe).  A backpatch could be revisited in the future.  This
is the second time this change is attempted, ccfbd92 being the first
one, but this time no assertions are added for the case of a TLI history
file written concurrently by the WAL receiver or the startup process
because we can expect one to exist (some of the TAP tests are able to
trigger with a proper timing).

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13
Discussion: https://postgr.es/m/Ym6GZbqQdlalSKSG@paquier.xyz
---
 src/backend/access/transam/timeline.c | 18 +++++-------------
 src/backend/access/transam/xlog.c     |  9 +++------
 contrib/basic_archive/basic_archive.c |  5 +++--
 3 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..e0a2a8ea68 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -516,15 +512,11 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
 	/*
-	 * Now move the completed history file into place with its final name.
+	 * Now move the completed history file into place with its final name,
+	 * replacing any existing file with the same name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d8322fbd0..818284759d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3326,14 +3326,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 87dd77cdd3..db5fd87429 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -282,9 +282,10 @@ basic_archive_file_internal(const char *file, const char *path)
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
-- 
2.45.2

0001-Replace-durable_rename_excl-by-durable_rename-ta-v14.patchtext/x-diff; charset=us-asciiDownload
From 51649664dd7549ea2f7cadc26d024a3ea56d684a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 5 Jul 2022 10:16:12 +0900
Subject: [PATCH] Replace durable_rename_excl() by durable_rename(), take two

durable_rename_excl() attempts to avoid overwriting any existing files
by using link() and unlink(), and it falls back to rename() on some
platforms (aka WIN32), which offers no such overwrite protection.  Most
callers use durable_rename_excl() just in case there is an existing
file, but in practice there shouldn't be one (see below for more
details).

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  As per Nathan's tests, it is
possible to end up with two links to the same file in pg_wal after a
crash just before unlink() during WAL recycling.  Specifically, the test
produced links to the same file for the current WAL file and the next
one because the half-recycled WAL file was re-recycled upon restarting,
leading to WAL corruption.

This change replaces all the calls of durable_rename_excl() to
durable_rename().  This removes the protection against accidentally
overwriting an existing file, but some platforms are already living
without it and ordinarily there shouldn't be one.  The function itself
is left around in case any extensions are using it.  It will be removed
on HEAD via a follow-up commit.

Here is a summary of the existing callers of durable_rename_excl() (see
second discussion link at the bottom), replaced by this commit.  First,
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.  Second, xlog.c uses it to recycle
past WAL segments, where an overwrite should not happen (origin of the
change at f0e37a8) because there are protections about the WAL segment
to select when recycling an entry.  The third and last area is related
to the write of timeline history files.  writeTimeLineHistory() will
write a new timeline history file at the end of recovery on promotion,
so there should be no such files for the same timeline.
What remains is writeTimeLineHistoryFile(), that can be used in parallel
by a WAL receiver and the startup process, and some digging of the
buildfarm shows that EEXIST from a WAL receiver can happen with an error
of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\",
which would cause an automatic restart of the WAL receiver as it is
promoted to FATAL, hence this should improve the stability of the WAL
receiver as rename() would overwrite an existing TLI history file
already fetched by the startup process at recovery.

This is a bug fix, but knowing the unlikeliness of the problem involving
one or more crashes at an exceptionally bad moment, no backpatch is
done.  Also, I want to be careful with such changes (aaa3aed did the
opposite of this change by removing HAVE_WORKING_LINK so as Windows
would do a link() rather than a rename() but this was not
concurrent-safe).  A backpatch could be revisited in the future.  This
is the second time this change is attempted, ccfbd92 being the first
one, but this time no assertions are added for the case of a TLI history
file written concurrently by the WAL receiver or the startup process
because we can expect one to exist (some of the TAP tests are able to
trigger with a proper timing).

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13
Discussion: https://postgr.es/m/Ym6GZbqQdlalSKSG@paquier.xyz
---
 src/backend/access/transam/timeline.c | 18 +++++-------------
 src/backend/access/transam/xlog.c     |  9 +++------
 2 files changed, 8 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..eeb505fb6a 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -516,15 +512,11 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
 	/*
-	 * Now move the completed history file into place with its final name.
+	 * Now move the completed history file into place with its final name,
+	 * replacing any existing file with the same name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f1a795bba9..a6e2cb88f3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3693,15 +3693,12 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.45.2

0001-Replace-durable_rename_excl-by-durable_rename-ta-v13.patchtext/x-diff; charset=us-asciiDownload
From 93c566230e5686da1b1d40186f31b79339be2bcf Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 5 Jul 2022 10:16:12 +0900
Subject: [PATCH] Replace durable_rename_excl() by durable_rename(), take two

durable_rename_excl() attempts to avoid overwriting any existing files
by using link() and unlink(), and it falls back to rename() on some
platforms (aka WIN32), which offers no such overwrite protection.  Most
callers use durable_rename_excl() just in case there is an existing
file, but in practice there shouldn't be one (see below for more
details).

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  As per Nathan's tests, it is
possible to end up with two links to the same file in pg_wal after a
crash just before unlink() during WAL recycling.  Specifically, the test
produced links to the same file for the current WAL file and the next
one because the half-recycled WAL file was re-recycled upon restarting,
leading to WAL corruption.

This change replaces all the calls of durable_rename_excl() to
durable_rename().  This removes the protection against accidentally
overwriting an existing file, but some platforms are already living
without it and ordinarily there shouldn't be one.  The function itself
is left around in case any extensions are using it.  It will be removed
on HEAD via a follow-up commit.

Here is a summary of the existing callers of durable_rename_excl() (see
second discussion link at the bottom), replaced by this commit.  First,
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.  Second, xlog.c uses it to recycle
past WAL segments, where an overwrite should not happen (origin of the
change at f0e37a8) because there are protections about the WAL segment
to select when recycling an entry.  The third and last area is related
to the write of timeline history files.  writeTimeLineHistory() will
write a new timeline history file at the end of recovery on promotion,
so there should be no such files for the same timeline.
What remains is writeTimeLineHistoryFile(), that can be used in parallel
by a WAL receiver and the startup process, and some digging of the
buildfarm shows that EEXIST from a WAL receiver can happen with an error
of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\",
which would cause an automatic restart of the WAL receiver as it is
promoted to FATAL, hence this should improve the stability of the WAL
receiver as rename() would overwrite an existing TLI history file
already fetched by the startup process at recovery.

This is a bug fix, but knowing the unlikeliness of the problem involving
one or more crashes at an exceptionally bad moment, no backpatch is
done.  Also, I want to be careful with such changes (aaa3aed did the
opposite of this change by removing HAVE_WORKING_LINK so as Windows
would do a link() rather than a rename() but this was not
concurrent-safe).  A backpatch could be revisited in the future.  This
is the second time this change is attempted, ccfbd92 being the first
one, but this time no assertions are added for the case of a TLI history
file written concurrently by the WAL receiver or the startup process
because we can expect one to exist (some of the TAP tests are able to
trigger with a proper timing).

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13
Discussion: https://postgr.es/m/Ym6GZbqQdlalSKSG@paquier.xyz
---
 src/backend/access/transam/timeline.c | 18 +++++-------------
 src/backend/access/transam/xlog.c     |  9 +++------
 2 files changed, 8 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index e6a29d9a9b..517ab023a3 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -516,15 +512,11 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
 	/*
-	 * Now move the completed history file into place with its final name.
+	 * Now move the completed history file into place with its final name,
+	 * replacing any existing file with the same name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ab4a510ea7..1fa15a58be 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3648,15 +3648,12 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.45.2

#35Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#34)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:

I've been double-checking the code to refresh myself with the problem,
and I don't see a reason to not apply something like the attached set
down to v13 for all these remaining branches (minus an edit of the
commit message).

LGTM

--
nathan

#36Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#35)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:

On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:

I've been double-checking the code to refresh myself with the problem,
and I don't see a reason to not apply something like the attached set
down to v13 for all these remaining branches (minus an edit of the
commit message).

LGTM

Dito.

#37Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#36)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:

On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:

LGTM

Dito.

Thanks for double-checking. Done this one.
--
Michael

#38Robert Pang
robertpang@google.com
In reply to: Michael Paquier (#37)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

Dear Michael,

Thank you for applying this back-patch. I also appreciate everyone's input
on this issue.

Sincerely,
Robert Pang

On Thu, Dec 19, 2024 at 4:13 PM Michael Paquier <michael@paquier.xyz> wrote:

Show quoted text

On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:

On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:

LGTM

Dito.

Thanks for double-checking. Done this one.
--
Michael

#39Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#34)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:

On Wed, Dec 18, 2024 at 08:51:20PM -0500, Andres Freund wrote:

I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message. The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server. And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.

Looks so, yep. Your timing and the report's timing are interesting.

I've been double-checking the code to refresh myself with the problem,
and I don't see a reason to not apply something like the attached set
down to v13 for all these remaining branches (minus an edit of the
commit message).

This became v14 commit 1f95181b44c843729caaa688f74babe9403b5850. I think it
is causing timing-dependent failures in archive recovery, at restartpoints.
v14 and earlier were relying on the "excl" part of durable_rename_excl() to
keep WAL preallocation and recycling from stomping on archive-restored WAL
files. If that's right, we need to either back-patch v15's suppression of
those features during archive recovery (commit cc2c7d6 and five others), or we
need to revert and fix the multiple hard links differently.

v15 commit cc2c7d6 "Skip WAL recycling and preallocation during archive
recovery" wrote that it fixed "a spurious durable_rename_excl() LOG message".
Replacing durable_rename_excl() with durable_rename() increased consequences
beyond spurious messages. I regret that I didn't make that connection during
post-commit review of commit 1f95181b44c843729caaa688f74babe9403b5850.
Trouble emerges during archive recovery, not during streaming or crash
recovery. The symptom is "invalid magic number 0000 in log segment X, offset
0" when PreallocXlogFiles() stomps. The symptom is "unexpected pageaddr X in
log segment Y, offset 0" [X < Y] when RemoveOldXlogFiles() recycling stomps.
wal_recycle=off probably avoids the latter but not the former.

Options I see:

1. Make v14 and v13 skip WAL recycling and preallocation during archive
recovery, like newer branches do. I think that means back-patching the six
commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.

2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
Avoid multiple hard links by making durable_rename_excl() first rename
oldfile to a temporary name. (I haven't thought this through in detail.
It may not suffice.)

I'm leaning toward (1) at least enough to see how messy the back-patch would
be, since I don't like risks of designing old-branch-specific solutions when
v15/v16/v17 have a proven solution. What else should we be thinking about
before deciding?

Arun Thirupathi collected the symptoms, traced them to commit
1f95181b44c843729caaa688f74babe9403b5850, and reported the diagnosis to me.
These symptoms have not emerged in v15+.

Thanks,
nm

#40Nathan Bossart
nathandbossart@gmail.com
In reply to: Noah Misch (#39)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote:

Options I see:

1. Make v14 and v13 skip WAL recycling and preallocation during archive
recovery, like newer branches do. I think that means back-patching the six
commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.

2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
Avoid multiple hard links by making durable_rename_excl() first rename
oldfile to a temporary name. (I haven't thought this through in detail.
It may not suffice.)

I'm leaning toward (1) at least enough to see how messy the back-patch would
be, since I don't like risks of designing old-branch-specific solutions when
v15/v16/v17 have a proven solution. What else should we be thinking about
before deciding?

I agree with first attempting a back-patch. All of this stuff is from
2021-2022, so it's had time to bake and is IMHO lower risk.

--
nathan

#41Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#40)
14 attachment(s)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Thu, Mar 06, 2025 at 01:44:30PM -0600, Nathan Bossart wrote:

On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote:

1. Make v14 and v13 skip WAL recycling and preallocation during archive
recovery, like newer branches do. I think that means back-patching the six
commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.

2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
Avoid multiple hard links by making durable_rename_excl() first rename
oldfile to a temporary name. (I haven't thought this through in detail.
It may not suffice.)

I'm leaning toward (1) at least enough to see how messy the back-patch would
be, since I don't like risks of designing old-branch-specific solutions when
v15/v16/v17 have a proven solution. What else should we be thinking about
before deciding?

I agree with first attempting a back-patch. All of this stuff is from
2021-2022, so it's had time to bake and is IMHO lower risk.

Sorry for the late reply here (life things and the timing of Noah's
report).

Discarding the option where we would design a new solution specific to
v14 and v13, I don't really see a different third option here.

Choice 2 means that we'd just leave v13 and v14 exposed to
corruptions, and that the message to our user base is that they should
upgrade to v15 to get that fixed, even if we still have roughly two
more 2 years of official community support, as they would still be
exposed to the risk of having multiple links pointing to a single WAL
segment, creating corruption. As reported upthread, that can be
reached.

Choice 1 would be the logical path, even if it may not be the
practical one in terms of backpatch risks vs benefits. It would be
much better than recommending people that they have to upgrade to 15
at least to make sure that their version of Postgres is safer to use.

Saying that, I've looked at the whole for a good part of the day. In
15~, one thing that may make a backpatch messier is related to
recovery and the fact that Robert has improved/refactored the area
quite a bit, tweaking the code so as we rely on slightly different
assumptions in 13/14 compared to 15~. There is a piece around
ThisTimeLineID, for example, which is something that
InstallXLogFileSegment() uses and also something that's touched by
1f95181b44c8. The impact of these pieces of refactorings specific to
v15 is surprisingly lower than I thought it would.

Some notes about the six commits you are mentioning, looking at them
one by one on the 13 and 14 stable branches.

First, for v14:
cc2c7d6~4: some bumps with the removal of use_lock with one flag.
Looks OK otherwise.
cc2c7d6~3: Looks OK. Based on what this addresses, this is low-risk.
cc2c7d6~2: XLogWalRcvWrite() has a conflict as an effect of
XLogWalRcvWrite() that can be given a "tli". It still seems correct
to me to use "recvFileTLI = ThisTimeLineID".
cc2c7d6~1: 8ad6c5dbbe5a needs to be reverted first. Commit
8ad6c5dbbe5a has been introduced as a workaround for what we would
backpatch here. One conflict in walreceiver.c.
cc2c7d6: Does not conflict, which was a bit surprising, TBH.
043_no_contrecord_switch.pl gets unhappy here, on an assertion failure
in WaitForWALToBecomeAvailable() about InstallXLogFileSegmentActive.
e36cbef: Previous failure fixed by this one, with the same symptoms
as reported in the original thread that led to this commit.

Then, for v13:
cc2c7d6~4: One conflict with InstallXLogFileSegment() in xlog.c,
nothing huge.
cc2c7d6~3: Looks OK.
cc2c7d6~2: Looks OK.
cc2c7d6~1: With 8ad6c5dbbe5a reverted, same as the v14 part.
cc2c7d6: Does not conflict. 043_no_contrecord_switch.pl gets unhappy
here, same as above.
e36cbef: Same as above.

Noah, all these improvements are something you have directly worked
on in 15~. I'm hoping not to have missed something, even if I've
looked at the paths manipulating the WAL segments changed here on v13
and v14. Could it be possible for you to double-check and also run
more tests if your enviroments help? Perhaps this needs specific
prerequisites for recovery? We need more confidence in the solution
here, even if it's proving to work for v15, we may still have gaps
specific to v14 and/or v13.

I am attaching a full set of patches for v14 and v13 that can be used
for these tests. WDYT?
--
Michael

Attachments:

v14-0001-Remove-XLogFileInit-ability-to-skip-ControlFileL.patchtext/x-diff; charset=us-asciiDownload
From 27d2e964c585a2309bd340d856fe9e2402587b0e Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:55 -0700
Subject: [PATCH v14 1/7] Remove XLogFileInit() ability to skip
 ControlFileLock.

Cold paths, initdb and end-of-recovery, used it.  Don't optimize them.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/include/access/xlog.h             |  2 +-
 src/backend/access/transam/xlog.c     | 46 ++++++++-------------------
 src/backend/replication/walreceiver.c |  2 +-
 3 files changed, 16 insertions(+), 34 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ee3e369b79f3..596965d353d0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -296,7 +296,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
+extern int	XLogFileInit(XLogSegNo segno, bool *use_existent);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0334c2d3bba2..20449820c7cb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -927,8 +927,7 @@ static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-								   bool find_free, XLogSegNo max_segno,
-								   bool use_lock);
+								   bool find_free, XLogSegNo max_segno);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
@@ -2520,7 +2519,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent);
 			ReserveExternalFD();
 		}
 
@@ -3293,10 +3292,6 @@ XLogNeedsFlush(XLogRecPtr record)
  * pre-existing file will be deleted).  On return, true if a pre-existing
  * file was used.
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns FD of opened file.
  *
  * Note: errors here are ERROR not PANIC because we might or might not be
@@ -3305,7 +3300,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3465,8 +3460,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, max_segno,
-								use_lock))
+								*use_existent, max_segno))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3623,7 +3617,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3647,29 +3641,20 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * free slot is found between *segno and max_segno. (Ignored when find_free
  * is false.)
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns true if the file was installed successfully.  false indicates that
  * max_segno limit was exceeded, or an error occurred while renaming the
  * file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock)
+					   bool find_free, XLogSegNo max_segno)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
 
 	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
 
-	/*
-	 * We want to be sure that only one process does this at a time.
-	 */
-	if (use_lock)
-		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
 	if (!find_free)
 	{
@@ -3684,8 +3669,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 			if ((*segno) >= max_segno)
 			{
 				/* Failed to find a free slot within specified range */
-				if (use_lock)
-					LWLockRelease(ControlFileLock);
+				LWLockRelease(ControlFileLock);
 				return false;
 			}
 			(*segno)++;
@@ -3696,14 +3680,12 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	Assert(access(path, F_OK) != 0 && errno == ENOENT);
 	if (durable_rename(tmppath, path, LOG) != 0)
 	{
-		if (use_lock)
-			LWLockRelease(ControlFileLock);
+		LWLockRelease(ControlFileLock);
 		/* durable_rename already emitted log message */
 		return false;
 	}
 
-	if (use_lock)
-		LWLockRelease(ControlFileLock);
+	LWLockRelease(ControlFileLock);
 
 	return true;
 }
@@ -3974,7 +3956,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
+		lf = XLogFileInit(_logSegNo, &use_existent);
 		close(lf);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
@@ -4251,7 +4233,7 @@ RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		*endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(endlogSegNo, path,
-							   true, recycleSegNo, true))
+							   true, recycleSegNo))
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("recycled write-ahead log file \"%s\"",
@@ -5388,7 +5370,7 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5697,7 +5679,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		bool		use_existent = true;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4831a259c48a..4ad20f12b61b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -894,7 +894,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent);
 			recvFileTLI = ThisTimeLineID;
 		}
 
-- 
2.47.2

v14-0002-In-XLogFileInit-fix-use_existent-postcondition-t.patchtext/x-diff; charset=us-asciiDownload
From 019002a9e1975c101b7c84f62d89cf48a136c851 Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:55 -0700
Subject: [PATCH v14 2/7] In XLogFileInit(), fix *use_existent postcondition to
 suit callers.

Infrequently, the mismatch caused log_checkpoints messages and
TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high
by one.  Since that consequence is so minor, no back-patch.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/backend/access/transam/xlog.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 20449820c7cb..c9176eea68c4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3289,8 +3289,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * logsegno: identify segment to be created/opened.
  *
  * *use_existent: if true, OK to use a pre-existing file (else, any
- * pre-existing file will be deleted).  On return, true if a pre-existing
- * file was used.
+ * pre-existing file will be deleted).  On return, false iff this call added
+ * some segment on disk.
  *
  * Returns FD of opened file.
  *
@@ -3459,8 +3459,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	 * CheckPointSegments.
 	 */
 	max_segno = logsegno + CheckPointSegments;
-	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, max_segno))
+	if (InstallXLogFileSegment(&installed_segno, tmppath,
+							   *use_existent, max_segno))
+		*use_existent = false;
+	else
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3470,9 +3472,6 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 		unlink(tmppath);
 	}
 
-	/* Set flag to tell caller there was no existent file */
-	*use_existent = false;
-
 	/* Now open original target segment (might not be file I just made) */
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
 	if (fd < 0)
-- 
2.47.2

v14-0003-Remove-XLogFileInit-ability-to-unlink-a-pre-exis.patchtext/x-diff; charset=us-asciiDownload
From 8f4517030288192c81e32ddf456028dade87104e Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:56 -0700
Subject: [PATCH v14 3/7] Remove XLogFileInit() ability to unlink a
 pre-existing file.

Only initdb used it.  initdb refuses to operate on a non-empty directory
and generally does not cope with pre-existing files of other kinds.
Hence, use the opportunity to simplify.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/include/access/xlog.h             |  2 +-
 src/backend/access/transam/xlog.c     | 61 +++++++++++----------------
 src/backend/replication/walreceiver.c |  4 +-
 3 files changed, 28 insertions(+), 39 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 596965d353d0..1860107f9a2e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -296,7 +296,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent);
+extern int	XLogFileInit(XLogSegNo segno, bool *added);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9176eea68c4..d86295470908 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2452,7 +2452,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	bool		ispartialpage;
 	bool		last_iteration;
 	bool		finishing_seg;
-	bool		use_existent;
+	bool		added;
 	int			curridx;
 	int			npages;
 	int			startidx;
@@ -2518,8 +2518,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 							wal_segment_size);
 
 			/* create/use new log file */
-			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent);
+			openLogFile = XLogFileInit(openLogSegNo, &added);
 			ReserveExternalFD();
 		}
 
@@ -3288,9 +3287,7 @@ XLogNeedsFlush(XLogRecPtr record)
  *
  * logsegno: identify segment to be created/opened.
  *
- * *use_existent: if true, OK to use a pre-existing file (else, any
- * pre-existing file will be deleted).  On return, false iff this call added
- * some segment on disk.
+ * *added: on return, true if this call raised the number of extant segments.
  *
  * Returns FD of opened file.
  *
@@ -3300,7 +3297,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent)
+XLogFileInit(XLogSegNo logsegno, bool *added)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3315,19 +3312,17 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	/*
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
-	if (*use_existent)
+	*added = false;
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	if (fd < 0)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
-		{
-			if (errno != ENOENT)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not open file \"%s\": %m", path)));
-		}
-		else
-			return fd;
+		if (errno != ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
 	}
+	else
+		return fd;
 
 	/*
 	 * Initialize an empty (all zeroes) segment.  NOTE: it is possible that
@@ -3440,12 +3435,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
 	/*
-	 * Now move the segment into place with its final name.
-	 *
-	 * If caller didn't want to use a pre-existing file, get rid of any
-	 * pre-existing file.  Otherwise, cope with possibility that someone else
-	 * has created the file while we were filling ours: if so, use ours to
-	 * pre-create a future log segment.
+	 * Now move the segment into place with its final name.  Cope with
+	 * possibility that someone else has created the file while we were
+	 * filling ours: if so, use ours to pre-create a future log segment.
 	 */
 	installed_segno = logsegno;
 
@@ -3459,9 +3451,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	 * CheckPointSegments.
 	 */
 	max_segno = logsegno + CheckPointSegments;
-	if (InstallXLogFileSegment(&installed_segno, tmppath,
-							   *use_existent, max_segno))
-		*use_existent = false;
+	if (InstallXLogFileSegment(&installed_segno, tmppath, true, max_segno))
+		*added = true;
 	else
 	{
 		/*
@@ -3946,7 +3937,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 {
 	XLogSegNo	_logSegNo;
 	int			lf;
-	bool		use_existent;
+	bool		added;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3954,10 +3945,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	if (offset >= (uint32) (0.75 * wal_segment_size))
 	{
 		_logSegNo++;
-		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent);
+		lf = XLogFileInit(_logSegNo, &added);
 		close(lf);
-		if (!use_existent)
+		if (added)
 			CheckpointStats.ckpt_segs_added++;
 	}
 }
@@ -5271,7 +5261,7 @@ BootStrapXLOG(void)
 	XLogLongPageHeader longpage;
 	XLogRecord *record;
 	char	   *recptr;
-	bool		use_existent;
+	bool		added;
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
@@ -5368,8 +5358,7 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent);
+	openLogFile = XLogFileInit(1, &added);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5675,10 +5664,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 * The switch happened at a segment boundary, so just create the next
 		 * segment on the new timeline.
 		 */
-		bool		use_existent = true;
+		bool		added;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent);
+		fd = XLogFileInit(startLogSegNo, &added);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4ad20f12b61b..b6f07ab81f1d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -890,11 +890,11 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		if (recvFile < 0)
 		{
-			bool		use_existent = true;
+			bool		added = true;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			recvFile = XLogFileInit(recvSegNo, &use_existent);
+			recvFile = XLogFileInit(recvSegNo, &added);
 			recvFileTLI = ThisTimeLineID;
 		}
 
-- 
2.47.2

v14-0004-Revert-Add-HINT-for-restartpoint-race-with-KeepF.patchtext/x-diff; charset=us-asciiDownload
From 3086175a74b701e1ae1ddf6423cf796b6e2bde4c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 10 Mar 2025 13:18:52 +0900
Subject: [PATCH v14 4/7] Revert "Add HINT for restartpoint race with
 KeepFileRestoredFromArchive()."

This reverts commit 8ad6c5dbbe5a234c55c6663020db297251756006.
---
 src/backend/access/transam/xlog.c |  5 +----
 src/backend/storage/file/fd.c     | 10 ++--------
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d86295470908..793451bdf70c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3468,10 +3468,7 @@ XLogFileInit(XLogSegNo logsegno, bool *added)
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", path),
-				 (AmCheckpointerProcess() ?
-				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
-				  0)));
+				 errmsg("could not open file \"%s\": %m", path)));
 
 	elog(DEBUG2, "done creating and filling new WAL file");
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 09c4a93e40b0..484a07302a40 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -834,10 +834,7 @@ durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						oldfile, newfile),
-				 (AmCheckpointerProcess() ?
-				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
-				  0)));
+						oldfile, newfile)));
 		return -1;
 	}
 	unlink(oldfile);
@@ -847,10 +844,7 @@ durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						oldfile, newfile),
-				 (AmCheckpointerProcess() ?
-				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
-				  0)));
+						oldfile, newfile)));
 		return -1;
 	}
 #endif
-- 
2.47.2

v14-0005-Don-t-ERROR-on-PreallocXlogFiles-race-condition.patchtext/x-diff; charset=us-asciiDownload
From 712671efd7b77fada6dad308dbaaba28cab22064 Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:56 -0700
Subject: [PATCH v14 5/7] Don't ERROR on PreallocXlogFiles() race condition.

Before a restartpoint finishes PreallocXlogFiles(), a startup process
KeepFileRestoredFromArchive() call can unlink the preallocated segment.
If a CHECKPOINT sql command had elicited the restartpoint experiencing
the race condition, that sql command failed.  Moreover, the restartpoint
omitted its log_checkpoints message and some inessential resource
reclamation.  Prevent the ERROR by skipping open() of the segment.
Since these consequences are so minor, no back-patch.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/include/access/xlog.h             |  2 +-
 src/backend/access/transam/xlog.c     | 79 +++++++++++++++++++--------
 src/backend/replication/walreceiver.c |  4 +-
 3 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1860107f9a2e..a0706c06a791 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -296,7 +296,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *added);
+extern int	XLogFileInit(XLogSegNo segno);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 793451bdf70c..73a828f075ad 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2452,7 +2452,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	bool		ispartialpage;
 	bool		last_iteration;
 	bool		finishing_seg;
-	bool		added;
 	int			curridx;
 	int			npages;
 	int			startidx;
@@ -2518,7 +2517,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 							wal_segment_size);
 
 			/* create/use new log file */
-			openLogFile = XLogFileInit(openLogSegNo, &added);
+			openLogFile = XLogFileInit(openLogSegNo);
 			ReserveExternalFD();
 		}
 
@@ -3283,23 +3282,21 @@ XLogNeedsFlush(XLogRecPtr record)
 }
 
 /*
- * Create a new XLOG file segment, or open a pre-existing one.
+ * Try to make a given XLOG file segment exist.
  *
- * logsegno: identify segment to be created/opened.
+ * logsegno: identify segment.
  *
  * *added: on return, true if this call raised the number of extant segments.
  *
- * Returns FD of opened file.
+ * path: on return, this char[MAXPGPATH] has the path to the logsegno file.
  *
- * Note: errors here are ERROR not PANIC because we might or might not be
- * inside a critical section (eg, during checkpoint there is no reason to
- * take down the system on failure).  They will promote to PANIC if we are
- * in a critical section.
+ * Returns -1 or FD of opened file.  A -1 here is not an error; a caller
+ * wanting an open segment should attempt to open "path", which usually will
+ * succeed.  (This is weird, but it's efficient for the callers.)
  */
-int
-XLogFileInit(XLogSegNo logsegno, bool *added)
+static int
+XLogFileInitInternal(XLogSegNo logsegno, bool *added, char *path)
 {
-	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
@@ -3452,26 +3449,53 @@ XLogFileInit(XLogSegNo logsegno, bool *added)
 	 */
 	max_segno = logsegno + CheckPointSegments;
 	if (InstallXLogFileSegment(&installed_segno, tmppath, true, max_segno))
+	{
 		*added = true;
+		elog(DEBUG2, "done creating and filling new WAL file");
+	}
 	else
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
-		 * failed to rename the file into place. If the rename failed, opening
-		 * the file below will fail.
+		 * failed to rename the file into place. If the rename failed, a
+		 * caller opening the file may fail.
 		 */
 		unlink(tmppath);
+		elog(DEBUG2, "abandoned new WAL file");
 	}
 
+	return -1;
+}
+
+/*
+ * Create a new XLOG file segment, or open a pre-existing one.
+ *
+ * logsegno: identify segment to be created/opened.
+ *
+ * Returns FD of opened file.
+ *
+ * Note: errors here are ERROR not PANIC because we might or might not be
+ * inside a critical section (eg, during checkpoint there is no reason to
+ * take down the system on failure).  They will promote to PANIC if we are
+ * in a critical section.
+ */
+int
+XLogFileInit(XLogSegNo logsegno)
+{
+	bool		ignore_added;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	fd = XLogFileInitInternal(logsegno, &ignore_added, path);
+	if (fd >= 0)
+		return fd;
+
 	/* Now open original target segment (might not be file I just made) */
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
-
-	elog(DEBUG2, "done creating and filling new WAL file");
-
 	return fd;
 }
 
@@ -3928,6 +3952,15 @@ XLogFileClose(void)
  * High-volume systems will be OK once they've built up a sufficient set of
  * recycled log segments, but the startup transient is likely to include
  * a lot of segment creations by foreground processes, which is not so good.
+ *
+ * XLogFileInitInternal() can ereport(ERROR).  All known causes indicate big
+ * trouble; for example, a full filesystem is one cause.  The checkpoint WAL
+ * and/or ControlFile updates already completed.  If a RequestCheckpoint()
+ * initiated the present checkpoint and an ERROR ends this function, the
+ * command that called RequestCheckpoint() fails.  That's not ideal, but it's
+ * not worth contorting more functions to use caller-specified elevel values.
+ * (With or without RequestCheckpoint(), an ERROR forestalls some inessential
+ * reporting and resource reclamation.)
  */
 static void
 PreallocXlogFiles(XLogRecPtr endptr)
@@ -3935,6 +3968,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		added;
+	char		path[MAXPGPATH];
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3942,8 +3976,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	if (offset >= (uint32) (0.75 * wal_segment_size))
 	{
 		_logSegNo++;
-		lf = XLogFileInit(_logSegNo, &added);
-		close(lf);
+		lf = XLogFileInitInternal(_logSegNo, &added, path);
+		if (lf >= 0)
+			close(lf);
 		if (added)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -5258,7 +5293,6 @@ BootStrapXLOG(void)
 	XLogLongPageHeader longpage;
 	XLogRecord *record;
 	char	   *recptr;
-	bool		added;
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
@@ -5355,7 +5389,7 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	openLogFile = XLogFileInit(1, &added);
+	openLogFile = XLogFileInit(1);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5661,10 +5695,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 * The switch happened at a segment boundary, so just create the next
 		 * segment on the new timeline.
 		 */
-		bool		added;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &added);
+		fd = XLogFileInit(startLogSegNo);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b6f07ab81f1d..7ae04b9682c7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -890,11 +890,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		if (recvFile < 0)
 		{
-			bool		added = true;
-
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			recvFile = XLogFileInit(recvSegNo, &added);
+			recvFile = XLogFileInit(recvSegNo);
 			recvFileTLI = ThisTimeLineID;
 		}
 
-- 
2.47.2

v14-0006-Skip-WAL-recycling-and-preallocation-during-arch.patchtext/x-diff; charset=us-asciiDownload
From 8bf97022fdd53a019c709efb757a5f1d953a1474 Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:56 -0700
Subject: [PATCH v14 6/7] Skip WAL recycling and preallocation during archive
 recovery.

The previous commit addressed the chief consequences of a race condition
between InstallXLogFileSegment() and KeepFileRestoredFromArchive().  Fix
three lesser consequences.  A spurious durable_rename_excl() LOG message
remained possible.  KeepFileRestoredFromArchive() wasted the proceeds of
WAL recycling and preallocation.  Finally, XLogFileInitInternal() could
return a descriptor for a file that KeepFileRestoredFromArchive() had
already unlinked.  That felt like a recipe for future bugs.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/backend/access/transam/xlog.c | 65 +++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 73a828f075ad..c5fe7e6fa4cc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -673,6 +673,16 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedHotStandbyActive;
 
+	/*
+	 * InstallXLogFileSegmentActive indicates whether the checkpointer should
+	 * arrange for future segments by recycling and/or PreallocXlogFiles().
+	 * Protected by ControlFileLock.  Only the startup process changes it.  If
+	 * true, anyone can use InstallXLogFileSegment().  If false, the startup
+	 * process owns the exclusive right to install segments, by reading from
+	 * the archive and possibly replacing existing files.
+	 */
+	bool		InstallXLogFileSegmentActive;
+
 	/*
 	 * SharedPromoteIsTriggered indicates if a standby promotion has been
 	 * triggered.  Protected by info_lck.
@@ -935,6 +945,7 @@ static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+static void XLogShutdownWalRcv(void);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -3653,8 +3664,8 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * is false.)
  *
  * Returns true if the file was installed successfully.  false indicates that
- * max_segno limit was exceeded, or an error occurred while renaming the
- * file into place.
+ * max_segno limit was exceeded, the startup process has disabled this
+ * function for now, or an error occurred while renaming the file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
@@ -3666,6 +3677,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
 
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	if (!XLogCtl->InstallXLogFileSegmentActive)
+	{
+		LWLockRelease(ControlFileLock);
+		return false;
+	}
 
 	if (!find_free)
 	{
@@ -3770,6 +3786,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 	 */
 	if (source == XLOG_FROM_ARCHIVE)
 	{
+		Assert(!XLogCtl->InstallXLogFileSegmentActive);
 		KeepFileRestoredFromArchive(path, xlogfname);
 
 		/*
@@ -3971,6 +3988,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	char		path[MAXPGPATH];
 	uint64		offset;
 
+	if (!XLogCtl->InstallXLogFileSegmentActive)
+		return;					/* unlocked check says no */
+
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
 	offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
 	if (offset >= (uint32) (0.75 * wal_segment_size))
@@ -4252,6 +4272,7 @@ RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 	 */
 	if (wal_recycle &&
 		*endlogSegNo <= recycleSegNo &&
+		XLogCtl->InstallXLogFileSegmentActive &&	/* callee rechecks this */
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(endlogSegNo, path,
 							   true, recycleSegNo))
@@ -4265,7 +4286,7 @@ RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 	}
 	else
 	{
-		/* No need for any more future segments... */
+		/* No need for any more future segments, or recycling failed ... */
 		int			rc;
 
 		ereport(DEBUG2,
@@ -5270,6 +5291,7 @@ XLOGShmemInit(void)
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
+	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -5297,6 +5319,11 @@ BootStrapXLOG(void)
 	struct timeval tv;
 	pg_crc32c	crc;
 
+	/* allow ordinary WAL segment creation, like StartupXLOG() would */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = true;
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * Select a hopefully-unique system identifier code for this installation.
 	 * We use the result of gettimeofday(), including the fractional seconds
@@ -7685,7 +7712,7 @@ StartupXLOG(void)
 	 * over these records and subsequent ones if it's still alive when we
 	 * start writing WAL.
 	 */
-	ShutdownWalRcv();
+	XLogShutdownWalRcv();
 
 	/*
 	 * Reset unlogged relations to the contents of their INIT fork. This is
@@ -7710,7 +7737,7 @@ StartupXLOG(void)
 	 * recovery, e.g., timeline history file) from archive or pg_wal.
 	 *
 	 * Note that standby mode must be turned off after killing WAL receiver,
-	 * i.e., calling ShutdownWalRcv().
+	 * i.e., calling XLogShutdownWalRcv().
 	 */
 	Assert(!WalRcvStreaming());
 	StandbyMode = false;
@@ -7779,6 +7806,14 @@ StartupXLOG(void)
 	 */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
 
+	/*
+	 * Allow ordinary WAL segment creation before any exitArchiveRecovery(),
+	 * which sometimes creates a segment, and after the last ReadRecord().
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = true;
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
@@ -12717,7 +12752,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 						return false;
 					}
 
@@ -12765,7 +12800,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * WAL that we restore from archive.
 					 */
 					if (WalRcvStreaming())
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 
 					/*
 					 * Before we sleep, re-scan for possible new timelines if
@@ -12895,7 +12930,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (pendingWalRcvRestart && !startWalReceiver)
 					{
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 
 						/*
 						 * Re-scan for possible new timelines if we were
@@ -12945,6 +12980,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 									 tli, curFileTLI);
 						}
 						curFileTLI = tli;
+						LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+						XLogCtl->InstallXLogFileSegmentActive = true;
+						LWLockRelease(ControlFileLock);
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
@@ -13115,6 +13153,17 @@ StartupRequestWalReceiverRestart(void)
 	}
 }
 
+/* Thin wrapper around ShutdownWalRcv(). */
+static void
+XLogShutdownWalRcv(void)
+{
+	ShutdownWalRcv();
+
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = false;
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Determine what log level should be used to report a corrupt WAL record
  * in the current WAL page, previously read by XLogPageRead().
-- 
2.47.2

v14-0007-Reset-InstallXLogFileSegmentActive-after-walrece.patchtext/x-diff; charset=us-asciiDownload
From c2d49a28a5d1a0e48c7a9edf97c481d984af98e9 Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Thu, 15 Sep 2022 06:45:23 -0700
Subject: [PATCH v14 7/7] Reset InstallXLogFileSegmentActive after walreceiver
 self-initiated exit.

After commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16 added this flag,
failure to reset it caused assertion failures.  In non-assert builds, it
made the system fail to achieve the objectives listed in that commit;
chiefly, we might emit a spurious log message.  Back-patch to v15, where
that commit first appeared.

Bharath Rupireddy and Kyotaro Horiguchi.  Reviewed by Dilip Kumar,
Nathan Bossart and Michael Paquier.  Reported by Dilip Kumar.

Discussion: https://postgr.es/m/CAFiTN-sE3ry=ycMPVtC+Djw4Fd7gbUGVv_qqw6qfzp=JLvqT3g@mail.gmail.com
---
 src/backend/access/transam/xlog.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c5fe7e6fa4cc..9ce4297e3de6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12799,8 +12799,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * walreceiver is not active, so that it won't overwrite
 					 * WAL that we restore from archive.
 					 */
-					if (WalRcvStreaming())
-						XLogShutdownWalRcv();
+					XLogShutdownWalRcv();
 
 					/*
 					 * Before we sleep, re-scan for possible new timelines if
-- 
2.47.2

v13-0001-Remove-XLogFileInit-ability-to-skip-ControlFileL.patchtext/x-diff; charset=us-asciiDownload
From d5de431953286182b8b0bfea4c0ecf23c1d31fbc Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:55 -0700
Subject: [PATCH v13 1/7] Remove XLogFileInit() ability to skip
 ControlFileLock.

Cold paths, initdb and end-of-recovery, used it.  Don't optimize them.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/include/access/xlog.h             |  2 +-
 src/backend/access/transam/xlog.c     | 46 ++++++++-------------------
 src/backend/replication/walreceiver.c |  2 +-
 3 files changed, 16 insertions(+), 34 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 01b3b0a767dc..7cb147a7ee37 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -286,7 +286,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
+extern int	XLogFileInit(XLogSegNo segno, bool *use_existent);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 769f4a202f28..859f86cfc3cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -917,8 +917,7 @@ static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-								   bool find_free, XLogSegNo max_segno,
-								   bool use_lock);
+								   bool find_free, XLogSegNo max_segno);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
@@ -2509,7 +2508,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent);
 			ReserveExternalFD();
 		}
 
@@ -3263,10 +3262,6 @@ XLogNeedsFlush(XLogRecPtr record)
  * pre-existing file will be deleted).  On return, true if a pre-existing
  * file was used.
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns FD of opened file.
  *
  * Note: errors here are ERROR not PANIC because we might or might not be
@@ -3275,7 +3270,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3420,8 +3415,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, max_segno,
-								use_lock))
+								*use_existent, max_segno))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3578,7 +3572,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3602,29 +3596,20 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * free slot is found between *segno and max_segno. (Ignored when find_free
  * is false.)
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns true if the file was installed successfully.  false indicates that
  * max_segno limit was exceeded, or an error occurred while renaming the
  * file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock)
+					   bool find_free, XLogSegNo max_segno)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
 
 	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
 
-	/*
-	 * We want to be sure that only one process does this at a time.
-	 */
-	if (use_lock)
-		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
 	if (!find_free)
 	{
@@ -3639,8 +3624,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 			if ((*segno) >= max_segno)
 			{
 				/* Failed to find a free slot within specified range */
-				if (use_lock)
-					LWLockRelease(ControlFileLock);
+				LWLockRelease(ControlFileLock);
 				return false;
 			}
 			(*segno)++;
@@ -3651,14 +3635,12 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	Assert(access(path, F_OK) != 0 && errno == ENOENT);
 	if (durable_rename(tmppath, path, LOG) != 0)
 	{
-		if (use_lock)
-			LWLockRelease(ControlFileLock);
+		LWLockRelease(ControlFileLock);
 		/* durable_rename already emitted log message */
 		return false;
 	}
 
-	if (use_lock)
-		LWLockRelease(ControlFileLock);
+	LWLockRelease(ControlFileLock);
 
 	return true;
 }
@@ -3929,7 +3911,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
+		lf = XLogFileInit(_logSegNo, &use_existent);
 		close(lf);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
@@ -4206,7 +4188,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 		endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
-							   true, recycleSegNo, true))
+							   true, recycleSegNo))
 	{
 		ereport(DEBUG2,
 				(errmsg("recycled write-ahead log file \"%s\"",
@@ -5342,7 +5324,7 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5651,7 +5633,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		bool		use_existent = true;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 479c6c441bda..b874cf364f06 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -922,7 +922,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent);
 			recvFileTLI = ThisTimeLineID;
 		}
 
-- 
2.47.2

v13-0002-In-XLogFileInit-fix-use_existent-postcondition-t.patchtext/x-diff; charset=us-asciiDownload
From e805d4e8b2bcfb7d72552ce914e480606c978bef Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:55 -0700
Subject: [PATCH v13 2/7] In XLogFileInit(), fix *use_existent postcondition to
 suit callers.

Infrequently, the mismatch caused log_checkpoints messages and
TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high
by one.  Since that consequence is so minor, no back-patch.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/backend/access/transam/xlog.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 859f86cfc3cd..28fd6da7d366 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3259,8 +3259,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * logsegno: identify segment to be created/opened.
  *
  * *use_existent: if true, OK to use a pre-existing file (else, any
- * pre-existing file will be deleted).  On return, true if a pre-existing
- * file was used.
+ * pre-existing file will be deleted).  On return, false iff this call added
+ * some segment on disk.
  *
  * Returns FD of opened file.
  *
@@ -3414,8 +3414,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	 * CheckPointSegments.
 	 */
 	max_segno = logsegno + CheckPointSegments;
-	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, max_segno))
+	if (InstallXLogFileSegment(&installed_segno, tmppath,
+							   *use_existent, max_segno))
+		*use_existent = false;
+	else
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3425,9 +3427,6 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 		unlink(tmppath);
 	}
 
-	/* Set flag to tell caller there was no existent file */
-	*use_existent = false;
-
 	/* Now open original target segment (might not be file I just made) */
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
 	if (fd < 0)
-- 
2.47.2

v13-0003-Remove-XLogFileInit-ability-to-unlink-a-pre-exis.patchtext/x-diff; charset=us-asciiDownload
From 3944585703be84b60e7ff1c45c2578ee312aa736 Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:56 -0700
Subject: [PATCH v13 3/7] Remove XLogFileInit() ability to unlink a
 pre-existing file.

Only initdb used it.  initdb refuses to operate on a non-empty directory
and generally does not cope with pre-existing files of other kinds.
Hence, use the opportunity to simplify.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/include/access/xlog.h             |  2 +-
 src/backend/access/transam/xlog.c     | 61 +++++++++++----------------
 src/backend/replication/walreceiver.c |  4 +-
 3 files changed, 28 insertions(+), 39 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 7cb147a7ee37..6119452ba8cd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -286,7 +286,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent);
+extern int	XLogFileInit(XLogSegNo segno, bool *added);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 28fd6da7d366..6fbb5328fd0b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2440,7 +2440,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	bool		ispartialpage;
 	bool		last_iteration;
 	bool		finishing_seg;
-	bool		use_existent;
+	bool		added;
 	int			curridx;
 	int			npages;
 	int			startidx;
@@ -2507,8 +2507,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 							wal_segment_size);
 
 			/* create/use new log file */
-			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent);
+			openLogFile = XLogFileInit(openLogSegNo, &added);
 			ReserveExternalFD();
 		}
 
@@ -3258,9 +3257,7 @@ XLogNeedsFlush(XLogRecPtr record)
  *
  * logsegno: identify segment to be created/opened.
  *
- * *use_existent: if true, OK to use a pre-existing file (else, any
- * pre-existing file will be deleted).  On return, false iff this call added
- * some segment on disk.
+ * *added: on return, true if this call raised the number of extant segments.
  *
  * Returns FD of opened file.
  *
@@ -3270,7 +3267,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent)
+XLogFileInit(XLogSegNo logsegno, bool *added)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3286,19 +3283,17 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	/*
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
-	if (*use_existent)
+	*added = false;
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	if (fd < 0)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
-		{
-			if (errno != ENOENT)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not open file \"%s\": %m", path)));
-		}
-		else
-			return fd;
+		if (errno != ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
 	}
+	else
+		return fd;
 
 	/*
 	 * Initialize an empty (all zeroes) segment.  NOTE: it is possible that
@@ -3395,12 +3390,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
 	/*
-	 * Now move the segment into place with its final name.
-	 *
-	 * If caller didn't want to use a pre-existing file, get rid of any
-	 * pre-existing file.  Otherwise, cope with possibility that someone else
-	 * has created the file while we were filling ours: if so, use ours to
-	 * pre-create a future log segment.
+	 * Now move the segment into place with its final name.  Cope with
+	 * possibility that someone else has created the file while we were
+	 * filling ours: if so, use ours to pre-create a future log segment.
 	 */
 	installed_segno = logsegno;
 
@@ -3414,9 +3406,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 	 * CheckPointSegments.
 	 */
 	max_segno = logsegno + CheckPointSegments;
-	if (InstallXLogFileSegment(&installed_segno, tmppath,
-							   *use_existent, max_segno))
-		*use_existent = false;
+	if (InstallXLogFileSegment(&installed_segno, tmppath, true, max_segno))
+		*added = true;
 	else
 	{
 		/*
@@ -3901,7 +3892,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 {
 	XLogSegNo	_logSegNo;
 	int			lf;
-	bool		use_existent;
+	bool		added;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3909,10 +3900,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	if (offset >= (uint32) (0.75 * wal_segment_size))
 	{
 		_logSegNo++;
-		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent);
+		lf = XLogFileInit(_logSegNo, &added);
 		close(lf);
-		if (!use_existent)
+		if (added)
 			CheckpointStats.ckpt_segs_added++;
 	}
 }
@@ -5225,7 +5215,7 @@ BootStrapXLOG(void)
 	XLogLongPageHeader longpage;
 	XLogRecord *record;
 	char	   *recptr;
-	bool		use_existent;
+	bool		added;
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
@@ -5322,8 +5312,7 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent);
+	openLogFile = XLogFileInit(1, &added);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5629,10 +5618,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 * The switch happened at a segment boundary, so just create the next
 		 * segment on the new timeline.
 		 */
-		bool		use_existent = true;
+		bool		added;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent);
+		fd = XLogFileInit(startLogSegNo, &added);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b874cf364f06..976dbc04cde7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -918,11 +918,11 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		if (recvFile < 0)
 		{
-			bool		use_existent = true;
+			bool		added = true;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			recvFile = XLogFileInit(recvSegNo, &use_existent);
+			recvFile = XLogFileInit(recvSegNo, &added);
 			recvFileTLI = ThisTimeLineID;
 		}
 
-- 
2.47.2

v13-0004-Revert-Add-HINT-for-restartpoint-race-with-KeepF.patchtext/x-diff; charset=us-asciiDownload
From 994fd5e49f76e40b9eae6d02d0a463d5048e9302 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 10 Mar 2025 13:18:52 +0900
Subject: [PATCH v13 4/7] Revert "Add HINT for restartpoint race with
 KeepFileRestoredFromArchive()."

This reverts commit 8ad6c5dbbe5a234c55c6663020db297251756006.
---
 src/backend/access/transam/xlog.c |  5 +----
 src/backend/storage/file/fd.c     | 10 ++--------
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6fbb5328fd0b..4ef417f1eddb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3423,10 +3423,7 @@ XLogFileInit(XLogSegNo logsegno, bool *added)
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", path),
-				 (AmCheckpointerProcess() ?
-				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
-				  0)));
+				 errmsg("could not open file \"%s\": %m", path)));
 
 	elog(DEBUG2, "done creating and filling new WAL file");
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 452ab88cafad..7aa50a5d2e57 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -800,10 +800,7 @@ durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						oldfile, newfile),
-				 (AmCheckpointerProcess() ?
-				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
-				  0)));
+						oldfile, newfile)));
 		return -1;
 	}
 	unlink(oldfile);
@@ -813,10 +810,7 @@ durable_rename_excl(const char *oldfile, const char *newfile, int elevel)
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						oldfile, newfile),
-				 (AmCheckpointerProcess() ?
-				  errhint("This is known to fail occasionally during archive recovery, where it is harmless.") :
-				  0)));
+						oldfile, newfile)));
 		return -1;
 	}
 #endif
-- 
2.47.2

v13-0005-Don-t-ERROR-on-PreallocXlogFiles-race-condition.patchtext/x-diff; charset=us-asciiDownload
From 1ab26db206a00ce2a4b24996b7b13ae841662dfc Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:56 -0700
Subject: [PATCH v13 5/7] Don't ERROR on PreallocXlogFiles() race condition.

Before a restartpoint finishes PreallocXlogFiles(), a startup process
KeepFileRestoredFromArchive() call can unlink the preallocated segment.
If a CHECKPOINT sql command had elicited the restartpoint experiencing
the race condition, that sql command failed.  Moreover, the restartpoint
omitted its log_checkpoints message and some inessential resource
reclamation.  Prevent the ERROR by skipping open() of the segment.
Since these consequences are so minor, no back-patch.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/include/access/xlog.h             |  2 +-
 src/backend/access/transam/xlog.c     | 79 +++++++++++++++++++--------
 src/backend/replication/walreceiver.c |  4 +-
 3 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6119452ba8cd..7526378d57f0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -286,7 +286,7 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *added);
+extern int	XLogFileInit(XLogSegNo segno);
 extern int	XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4ef417f1eddb..bbc9dd87b40a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2440,7 +2440,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	bool		ispartialpage;
 	bool		last_iteration;
 	bool		finishing_seg;
-	bool		added;
 	int			curridx;
 	int			npages;
 	int			startidx;
@@ -2507,7 +2506,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 							wal_segment_size);
 
 			/* create/use new log file */
-			openLogFile = XLogFileInit(openLogSegNo, &added);
+			openLogFile = XLogFileInit(openLogSegNo);
 			ReserveExternalFD();
 		}
 
@@ -3253,23 +3252,21 @@ XLogNeedsFlush(XLogRecPtr record)
 }
 
 /*
- * Create a new XLOG file segment, or open a pre-existing one.
+ * Try to make a given XLOG file segment exist.
  *
- * logsegno: identify segment to be created/opened.
+ * logsegno: identify segment.
  *
  * *added: on return, true if this call raised the number of extant segments.
  *
- * Returns FD of opened file.
+ * path: on return, this char[MAXPGPATH] has the path to the logsegno file.
  *
- * Note: errors here are ERROR not PANIC because we might or might not be
- * inside a critical section (eg, during checkpoint there is no reason to
- * take down the system on failure).  They will promote to PANIC if we are
- * in a critical section.
+ * Returns -1 or FD of opened file.  A -1 here is not an error; a caller
+ * wanting an open segment should attempt to open "path", which usually will
+ * succeed.  (This is weird, but it's efficient for the callers.)
  */
-int
-XLogFileInit(XLogSegNo logsegno, bool *added)
+static int
+XLogFileInitInternal(XLogSegNo logsegno, bool *added, char *path)
 {
-	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
@@ -3407,26 +3404,53 @@ XLogFileInit(XLogSegNo logsegno, bool *added)
 	 */
 	max_segno = logsegno + CheckPointSegments;
 	if (InstallXLogFileSegment(&installed_segno, tmppath, true, max_segno))
+	{
 		*added = true;
+		elog(DEBUG2, "done creating and filling new WAL file");
+	}
 	else
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
-		 * failed to rename the file into place. If the rename failed, opening
-		 * the file below will fail.
+		 * failed to rename the file into place. If the rename failed, a
+		 * caller opening the file may fail.
 		 */
 		unlink(tmppath);
+		elog(DEBUG2, "abandoned new WAL file");
 	}
 
+	return -1;
+}
+
+/*
+ * Create a new XLOG file segment, or open a pre-existing one.
+ *
+ * logsegno: identify segment to be created/opened.
+ *
+ * Returns FD of opened file.
+ *
+ * Note: errors here are ERROR not PANIC because we might or might not be
+ * inside a critical section (eg, during checkpoint there is no reason to
+ * take down the system on failure).  They will promote to PANIC if we are
+ * in a critical section.
+ */
+int
+XLogFileInit(XLogSegNo logsegno)
+{
+	bool		ignore_added;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	fd = XLogFileInitInternal(logsegno, &ignore_added, path);
+	if (fd >= 0)
+		return fd;
+
 	/* Now open original target segment (might not be file I just made) */
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
-
-	elog(DEBUG2, "done creating and filling new WAL file");
-
 	return fd;
 }
 
@@ -3883,6 +3907,15 @@ XLogFileClose(void)
  * High-volume systems will be OK once they've built up a sufficient set of
  * recycled log segments, but the startup transient is likely to include
  * a lot of segment creations by foreground processes, which is not so good.
+ *
+ * XLogFileInitInternal() can ereport(ERROR).  All known causes indicate big
+ * trouble; for example, a full filesystem is one cause.  The checkpoint WAL
+ * and/or ControlFile updates already completed.  If a RequestCheckpoint()
+ * initiated the present checkpoint and an ERROR ends this function, the
+ * command that called RequestCheckpoint() fails.  That's not ideal, but it's
+ * not worth contorting more functions to use caller-specified elevel values.
+ * (With or without RequestCheckpoint(), an ERROR forestalls some inessential
+ * reporting and resource reclamation.)
  */
 static void
 PreallocXlogFiles(XLogRecPtr endptr)
@@ -3890,6 +3923,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		added;
+	char		path[MAXPGPATH];
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3897,8 +3931,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	if (offset >= (uint32) (0.75 * wal_segment_size))
 	{
 		_logSegNo++;
-		lf = XLogFileInit(_logSegNo, &added);
-		close(lf);
+		lf = XLogFileInitInternal(_logSegNo, &added, path);
+		if (lf >= 0)
+			close(lf);
 		if (added)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -5212,7 +5247,6 @@ BootStrapXLOG(void)
 	XLogLongPageHeader longpage;
 	XLogRecord *record;
 	char	   *recptr;
-	bool		added;
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
@@ -5309,7 +5343,7 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	openLogFile = XLogFileInit(1, &added);
+	openLogFile = XLogFileInit(1);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5615,10 +5649,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 * The switch happened at a segment boundary, so just create the next
 		 * segment on the new timeline.
 		 */
-		bool		added;
 		int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &added);
+		fd = XLogFileInit(startLogSegNo);
 
 		if (close(fd) != 0)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 976dbc04cde7..e1d7c0f75435 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -918,11 +918,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		if (recvFile < 0)
 		{
-			bool		added = true;
-
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			recvFile = XLogFileInit(recvSegNo, &added);
+			recvFile = XLogFileInit(recvSegNo);
 			recvFileTLI = ThisTimeLineID;
 		}
 
-- 
2.47.2

v13-0006-Skip-WAL-recycling-and-preallocation-during-arch.patchtext/x-diff; charset=us-asciiDownload
From 74f9824b0a2db22f0fabaa6059d6a8d33dd4e72b Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Mon, 28 Jun 2021 18:34:56 -0700
Subject: [PATCH v13 6/7] Skip WAL recycling and preallocation during archive
 recovery.

The previous commit addressed the chief consequences of a race condition
between InstallXLogFileSegment() and KeepFileRestoredFromArchive().  Fix
three lesser consequences.  A spurious durable_rename_excl() LOG message
remained possible.  KeepFileRestoredFromArchive() wasted the proceeds of
WAL recycling and preallocation.  Finally, XLogFileInitInternal() could
return a descriptor for a file that KeepFileRestoredFromArchive() had
already unlinked.  That felt like a recipe for future bugs.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
---
 src/backend/access/transam/xlog.c | 65 +++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bbc9dd87b40a..d585af79a6ec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -675,6 +675,16 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedHotStandbyActive;
 
+	/*
+	 * InstallXLogFileSegmentActive indicates whether the checkpointer should
+	 * arrange for future segments by recycling and/or PreallocXlogFiles().
+	 * Protected by ControlFileLock.  Only the startup process changes it.  If
+	 * true, anyone can use InstallXLogFileSegment().  If false, the startup
+	 * process owns the exclusive right to install segments, by reading from
+	 * the archive and possibly replacing existing files.
+	 */
+	bool		InstallXLogFileSegmentActive;
+
 	/*
 	 * SharedPromoteIsTriggered indicates if a standby promotion has been
 	 * triggered.  Protected by info_lck.
@@ -925,6 +935,7 @@ static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+static void XLogShutdownWalRcv(void);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -3608,8 +3619,8 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * is false.)
  *
  * Returns true if the file was installed successfully.  false indicates that
- * max_segno limit was exceeded, or an error occurred while renaming the
- * file into place.
+ * max_segno limit was exceeded, the startup process has disabled this
+ * function for now, or an error occurred while renaming the file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
@@ -3621,6 +3632,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
 
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	if (!XLogCtl->InstallXLogFileSegmentActive)
+	{
+		LWLockRelease(ControlFileLock);
+		return false;
+	}
 
 	if (!find_free)
 	{
@@ -3725,6 +3741,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 	 */
 	if (source == XLOG_FROM_ARCHIVE)
 	{
+		Assert(!XLogCtl->InstallXLogFileSegmentActive);
 		KeepFileRestoredFromArchive(path, xlogfname);
 
 		/*
@@ -3926,6 +3943,9 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	char		path[MAXPGPATH];
 	uint64		offset;
 
+	if (!XLogCtl->InstallXLogFileSegmentActive)
+		return;					/* unlocked check says no */
+
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
 	offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
 	if (offset >= (uint32) (0.75 * wal_segment_size))
@@ -4207,6 +4227,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 	 */
 	if (wal_recycle &&
 		endlogSegNo <= recycleSegNo &&
+		XLogCtl->InstallXLogFileSegmentActive &&	/* callee rechecks this */
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
 							   true, recycleSegNo))
@@ -4220,7 +4241,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 	}
 	else
 	{
-		/* No need for any more future segments... */
+		/* No need for any more future segments, or recycling failed ... */
 		int			rc;
 
 		ereport(DEBUG2,
@@ -5225,6 +5246,7 @@ XLOGShmemInit(void)
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
+	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -5251,6 +5273,11 @@ BootStrapXLOG(void)
 	struct timeval tv;
 	pg_crc32c	crc;
 
+	/* allow ordinary WAL segment creation, like StartupXLOG() would */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = true;
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * Select a hopefully-unique system identifier code for this installation.
 	 * We use the result of gettimeofday(), including the fractional seconds
@@ -7531,7 +7558,7 @@ StartupXLOG(void)
 	 * over these records and subsequent ones if it's still alive when we
 	 * start writing WAL.
 	 */
-	ShutdownWalRcv();
+	XLogShutdownWalRcv();
 
 	/*
 	 * Reset unlogged relations to the contents of their INIT fork. This is
@@ -7556,7 +7583,7 @@ StartupXLOG(void)
 	 * recovery, e.g., timeline history file) from archive or pg_wal.
 	 *
 	 * Note that standby mode must be turned off after killing WAL receiver,
-	 * i.e., calling ShutdownWalRcv().
+	 * i.e., calling XLogShutdownWalRcv().
 	 */
 	Assert(!WalRcvStreaming());
 	StandbyMode = false;
@@ -7625,6 +7652,14 @@ StartupXLOG(void)
 	 */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
 
+	/*
+	 * Allow ordinary WAL segment creation before any exitArchiveRecovery(),
+	 * which sometimes creates a segment, and after the last ReadRecord().
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = true;
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
@@ -12491,7 +12526,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 						return false;
 					}
 
@@ -12539,7 +12574,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * WAL that we restore from archive.
 					 */
 					if (WalRcvStreaming())
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 
 					/*
 					 * Before we sleep, re-scan for possible new timelines if
@@ -12669,7 +12704,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (pendingWalRcvRestart && !startWalReceiver)
 					{
-						ShutdownWalRcv();
+						XLogShutdownWalRcv();
 
 						/*
 						 * Re-scan for possible new timelines if we were
@@ -12720,6 +12755,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 									 tli, curFileTLI);
 						}
 						curFileTLI = tli;
+						LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+						XLogCtl->InstallXLogFileSegmentActive = true;
+						LWLockRelease(ControlFileLock);
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
@@ -12882,6 +12920,17 @@ StartupRequestWalReceiverRestart(void)
 	}
 }
 
+/* Thin wrapper around ShutdownWalRcv(). */
+static void
+XLogShutdownWalRcv(void)
+{
+	ShutdownWalRcv();
+
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	XLogCtl->InstallXLogFileSegmentActive = false;
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Determine what log level should be used to report a corrupt WAL record
  * in the current WAL page, previously read by XLogPageRead().
-- 
2.47.2

v13-0007-Reset-InstallXLogFileSegmentActive-after-walrece.patchtext/x-diff; charset=us-asciiDownload
From e1eedb586974de10baf149c128987ad2dde447bc Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Thu, 15 Sep 2022 06:45:23 -0700
Subject: [PATCH v13 7/7] Reset InstallXLogFileSegmentActive after walreceiver
 self-initiated exit.

After commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16 added this flag,
failure to reset it caused assertion failures.  In non-assert builds, it
made the system fail to achieve the objectives listed in that commit;
chiefly, we might emit a spurious log message.  Back-patch to v15, where
that commit first appeared.

Bharath Rupireddy and Kyotaro Horiguchi.  Reviewed by Dilip Kumar,
Nathan Bossart and Michael Paquier.  Reported by Dilip Kumar.

Discussion: https://postgr.es/m/CAFiTN-sE3ry=ycMPVtC+Djw4Fd7gbUGVv_qqw6qfzp=JLvqT3g@mail.gmail.com
---
 src/backend/access/transam/xlog.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d585af79a6ec..fc104a377ad9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12573,8 +12573,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * walreceiver is not active, so that it won't overwrite
 					 * WAL that we restore from archive.
 					 */
-					if (WalRcvStreaming())
-						XLogShutdownWalRcv();
+					XLogShutdownWalRcv();
 
 					/*
 					 * Before we sleep, re-scan for possible new timelines if
-- 
2.47.2

#42Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#41)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Mon, Mar 10, 2025 at 02:25:28PM +0900, Michael Paquier wrote:

I am attaching a full set of patches for v14 and v13 that can be used
for these tests. WDYT?

And I forgot to mention that I've checked both v14 and v13, where I
have noticed the two patterns "invalid magic number 0000 in log
segment X, offset 0" and "unexpected pageaddr X in log segment Y" as
LOG and DEBUG entries.

These are gone once the patches are applied on both branches. Just
set max_wal_size and min_wal_size (128MB here) low enough with a
standby doing archive recovery, and bulk-load WAL on the primary to
accelerate the reproduction of the problem. It takes a couple of
minutes to have these show up with the unpatched branches.
--
Michael

#43Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#41)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Mon, Mar 10, 2025 at 02:25:28PM +0900, Michael Paquier wrote:

On Thu, Mar 06, 2025 at 01:44:30PM -0600, Nathan Bossart wrote:

On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote:

1. Make v14 and v13 skip WAL recycling and preallocation during archive
recovery, like newer branches do. I think that means back-patching the six
commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.

2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
Avoid multiple hard links by making durable_rename_excl() first rename
oldfile to a temporary name. (I haven't thought this through in detail.
It may not suffice.)

I'm leaning toward (1) at least enough to see how messy the back-patch would
be, since I don't like risks of designing old-branch-specific solutions when
v15/v16/v17 have a proven solution. What else should we be thinking about
before deciding?

I agree with first attempting a back-patch. All of this stuff is from
2021-2022, so it's had time to bake and is IMHO lower risk.

Could it be possible for you to double-check and also run
more tests if your enviroments help?

Thanks for crafting back-branch versions. I've queued a task to confirm I get
the same result. There's a test case I'll polish, too.

#44Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#43)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote:

Thanks for crafting back-branch versions. I've queued a task to confirm I get
the same result.

Thanks for that. That helps a lot.

There's a test case I'll polish, too.

Are you considering the addition of a TAP test in 17~ based on a wait
injection point in the checkpointer coupled with a check of the server
logs to see if we see the error patterns you've spotted?
--
Michael

#45Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#44)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Wed, Mar 12, 2025 at 09:46:27AM +0900, Michael Paquier wrote:

On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote:

Thanks for crafting back-branch versions. I've queued a task to confirm I get
the same result.

Thanks for that. That helps a lot.

I'll let you know when I get there.

There's a test case I'll polish, too.

Are you considering the addition of a TAP test in 17~ based on a wait
injection point in the checkpointer coupled with a check of the server
logs to see if we see the error patterns you've spotted?

No, nothing involving injection points or otherwise version-specific.

#46Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#45)
1 attachment(s)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Tue, Mar 11, 2025 at 06:23:15PM -0700, Noah Misch wrote:

On Wed, Mar 12, 2025 at 09:46:27AM +0900, Michael Paquier wrote:

On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote:

Thanks for crafting back-branch versions. I've queued a task to confirm I get
the same result.

Thanks for that. That helps a lot.

I'll let you know when I get there.

Your back-patches are correct. Thanks.

There's a test case I'll polish, too.

Are you considering the addition of a TAP test in 17~ based on a wait
injection point in the checkpointer coupled with a check of the server
logs to see if we see the error patterns you've spotted?

No, nothing involving injection points or otherwise version-specific.

Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s.
Hence, it couldn't be expected to catch the regression before commit, but it
would have made sufficient buildfarm and CI noise in the day after commit.

Attachments:

test-archive-restartpoint-v1.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Test restartpoints in archive recovery.
    
    v14 commit 1f95181b44c843729caaa688f74babe9403b5850 and its v13 equivalent
    caused timing-dependent failures in archive recovery, at restartpoints.  The
    symptom was "invalid magic number 0000 in log segment X, offset 0",
    "unexpected pageaddr X in log segment Y, offset 0" [X < Y], or an assertion
    failure.  Commit FIXME and predecessors back-patched v15 changes to fix that.
    This test reproduces the problem probabilistically, typically in less than
    1000 iterations of the test.  Hence, buildfarm and CI runs would have surfaced
    enough failures to get attention within a day.  Back-patch to v13 (all
    supported versions).
    
    Reported by Arun Thirupathi.  Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde..cb98376 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
       't/042_low_level_backup.pl',
       't/043_no_contrecord_switch.pl',
       't/044_invalidate_inactive_slots.pl',
+      't/045_archive_restartpoint.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/045_archive_restartpoint.pl b/src/test/recovery/t/045_archive_restartpoint.pl
new file mode 100644
index 0000000..b143bc4
--- /dev/null
+++ b/src/test/recovery/t/045_archive_restartpoint.pl
@@ -0,0 +1,57 @@
+
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+# Test restartpoints during archive recovery.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $archive_max_mb = 320;
+my $wal_segsize = 1;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+	has_archiving => 1,
+	allows_streaming => 1,
+	extra => [ '--wal-segsize' => $wal_segsize ]);
+$node_primary->start;
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+$node_primary->safe_psql('postgres',
+	('DO $$BEGIN FOR i IN 1..' . $archive_max_mb / $wal_segsize)
+	  . ' LOOP CHECKPOINT; PERFORM pg_switch_wal(); END LOOP; END$$;');
+
+# Force archiving of WAL file containing recovery target
+my $until_lsn = $node_primary->lsn('write');
+$node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");
+$node_primary->stop;
+
+# Archive recovery
+my $node_restore = PostgreSQL::Test::Cluster->new('restore');
+$node_restore->init_from_backup($node_primary, $backup_name,
+	has_restoring => 1);
+$node_restore->append_conf('postgresql.conf',
+	"recovery_target_lsn = '$until_lsn'");
+$node_restore->append_conf('postgresql.conf',
+	'recovery_target_action = pause');
+$node_restore->append_conf('postgresql.conf',
+	'max_wal_size = ' . 2 * $wal_segsize);
+$node_restore->append_conf('postgresql.conf', 'log_checkpoints = on');
+
+$node_restore->start;
+
+# Wait until restore has replayed enough data
+my $caughtup_query =
+  "SELECT '$until_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+$node_restore->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for restore to catch up";
+
+$node_restore->stop;
+ok(1, 'restore caught up');
+
+done_testing();
#47Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#46)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:

Your back-patches are correct. Thanks.

Thanks for double-checking. I'll move on with what I have after a
second look as it's been a few weeks since I've looked at all these
conflicts. I am also planning to add a few notes in the commits to
mention this thread.

Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s.
Hence, it couldn't be expected to catch the regression before commit, but it
would have made sufficient buildfarm and CI noise in the day after commit.

Hmm. Not much of a fan of the addition of a test that has less than
1% of reproducibility for the problem, even if it's good to see that
this can be made portable to run down to v13.
--
Michael

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#47)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

Michael Paquier <michael@paquier.xyz> writes:

On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:

Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s.
Hence, it couldn't be expected to catch the regression before commit, but it
would have made sufficient buildfarm and CI noise in the day after commit.

Hmm. Not much of a fan of the addition of a test that has less than
1% of reproducibility for the problem, even if it's good to see that
this can be made portable to run down to v13.

Yeah, it's good to have a test but I doubt we should commit it.
Too many buildfarm cycles will be expended for too little result.

regards, tom lane

#49Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#48)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Sat, Apr 05, 2025 at 11:07:13AM -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:

Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s.
Hence, it couldn't be expected to catch the regression before commit, but it
would have made sufficient buildfarm and CI noise in the day after commit.

Hmm. Not much of a fan of the addition of a test that has less than
1% of reproducibility for the problem, even if it's good to see that
this can be made portable to run down to v13.

Yeah, it's good to have a test but I doubt we should commit it.
Too many buildfarm cycles will be expended for too little result.

Current extent of our archive recovery restartpoint test coverage:

$ grep -c 'restartpoint starting' $(grep -rl 'restored log file' **/log) | grep -v :0
src/bin/pg_combinebackup/tmp_check/log/002_compare_backups_pitr1.log:1
src/test/recovery/tmp_check/log/020_archive_status_standby2.log:1
src/test/recovery/tmp_check/log/002_archiving_standby.log:1
src/test/recovery/tmp_check/log/020_archive_status_standby.log:1
src/test/recovery/tmp_check/log/035_standby_logical_decoding_standby.log:2

Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
that's not enough. If the proposed 3-second test is the wrong thing, what
instead?

#50Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#49)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote:

Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
that's not enough. If the proposed 3-second test is the wrong thing, what
instead?

I don't have a good idea about that in ~16, TBH, but I am sure to not
be a fan of the low reproducibility rate of this test as proposed.
It's not perfect, but as the design to fix the original race condition
has been introduced in v15, why not begin with a test in 17~ using
some injection points? This should be good enough while having a good
reproduction rate as the order of the actions in the restart points
would be controlled.
--
Michael

#51Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#50)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Sun, Apr 06, 2025 at 07:42:02AM +0900, Michael Paquier wrote:

On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote:

Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
that's not enough. If the proposed 3-second test is the wrong thing, what
instead?

I don't have a good idea about that in ~16, TBH, but I am sure to not
be a fan of the low reproducibility rate of this test as proposed.
It's not perfect, but as the design to fix the original race condition
has been introduced in v15, why not begin with a test in 17~ using
some injection points?

Two reasons:

a) The fix ended calls to the whole range of relevant code. Hence, the
injection point placement that would have been relevant before the fix
isn't reached. In other words, there's no right place for the injection
point. (The place for the injection point would be in durable_rename(), in
the checkpointer. After the fix, the checkpointer just doesn't call
durable_rename().)

b) Stochastic tests catch defects beyond the specific one the test author
targeted. An injection point test is less likely to do that. (That said,
with reason (a), there's no known injection point test design to compete
with the stochastic design.)

#52Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#47)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Sat, Apr 05, 2025 at 07:14:27PM +0900, Michael Paquier wrote:

On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:

Your back-patches are correct. Thanks.

Thanks for double-checking. I'll move on with what I have after a
second look as it's been a few weeks since I've looked at all these
conflicts. I am also planning to add a few notes in the commits to
mention this thread.

And applies this series on REL_14_STABLE and REL_13_STABLE, editing
all the commits to mention what they were linked to previously.
--
Michael

#53Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#51)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Sat, Apr 05, 2025 at 07:09:58PM -0700, Noah Misch wrote:

On Sun, Apr 06, 2025 at 07:42:02AM +0900, Michael Paquier wrote:

On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote:

Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
that's not enough. If the proposed 3-second test is the wrong thing, what
instead?

I don't have a good idea about that in ~16, TBH, but I am sure to not
be a fan of the low reproducibility rate of this test as proposed.
It's not perfect, but as the design to fix the original race condition
has been introduced in v15, why not begin with a test in 17~ using
some injection points?

Two reasons:

a) The fix ended calls to the whole range of relevant code. Hence, the
injection point placement that would have been relevant before the fix
isn't reached. In other words, there's no right place for the injection
point. (The place for the injection point would be in durable_rename(), in
the checkpointer. After the fix, the checkpointer just doesn't call
durable_rename().)

b) Stochastic tests catch defects beyond the specific one the test author
targeted. An injection point test is less likely to do that. (That said,
with reason (a), there's no known injection point test design to compete
with the stochastic design.)

Tom and Michael, do you still object to the test addition, or not? If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

As another data point, raising the runtime from 3s to 17s makes it reproduce
the problem 25% of the time. You can imagine a plot with axes of runtime and
percent detection. One can pick any point on that plot's curve. Given how
little wall time it takes for the buildfarm and CI to reach a few hundred
runs, I like the trade-off of 3s runtime and 1% detection. In particular, I
like it better than 17s runtime for 25% detection. How do you see it?

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#53)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

Noah Misch <noah@leadboat.com> writes:

Tom and Michael, do you still object to the test addition, or not? If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

While I still don't love it, I don't have a better proposal.

regards, tom lane

#55Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#54)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Tom and Michael, do you still object to the test addition, or not? If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

While I still don't love it, I don't have a better proposal.

No objections from here to use your proposal for the time being on all
the branches. I am planning to spend some cycles thinking about a
better alternative, but I doubt that this will be portable down to
v13.
--
Michael

#56Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#55)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:

On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Tom and Michael, do you still object to the test addition, or not? If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

Pushed as commit 714bd9e. The failure so far is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2025-04-20%2015%3A36%3A35
with these highlights:

pg_ctl: server does not shut down

2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"

The checkpoints and WAL creation took 30s, but archiving was only 20% done
(based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. I can
reproduce this if I test with valgrind --trace-children=yes. With my normal
valgrind settings, the whole test file takes only 18s. I recommend one of
these changes to skink:

- Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
"sh" and "cp" commands.
- Remove --trace-children=yes

Andres, what do you think about making one of those skink configuration
changes? Alternatively, I could make the test poll until archiving catches
up. However, that would take skink about 30min, and I expect little value
from 30min of valgrind instrumenting the "cp" command.

#57Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#56)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Sun, Apr 20, 2025 at 02:53:39PM -0700, Noah Misch wrote:

On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:

On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Tom and Michael, do you still object to the test addition, or not? If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

Pushed as commit 714bd9e. The failure so far is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2025-04-20%2015%3A36%3A35
with these highlights:

pg_ctl: server does not shut down

2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"

The checkpoints and WAL creation took 30s, but archiving was only 20% done
(based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. I can
reproduce this if I test with valgrind --trace-children=yes. With my normal
valgrind settings, the whole test file takes only 18s. I recommend one of
these changes to skink:

- Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
"sh" and "cp" commands.
- Remove --trace-children=yes

I gave that more thought. One can be more surgical than that, via
--trace-children-skip-by-arg='*cp "*' or similar. My previous message's two
options stop valgrind instrumentation at boundaries like pg_dumpall calling
system(pg_dump ...), since that execs /bin/sh to run pg_dump. If we wanted to
make it even more explicit and surgical, skink could use
--trace-children-skip-by-arg='*valgrind-ignore-child*' combined with:

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1404 +1404 @@ sub enable_restoring
-	  : qq{cp "$path/%f" "%p"};
+	  : qq{cp "$path/%f" "%p"  # valgrind-ignore-child};
@@ -1474 +1474 @@ sub enable_archiving
-	  : qq{cp "%p" "$path/%f"};
+	  : qq{cp "%p" "$path/%f"  # valgrind-ignore-child};

What's your preference?

Show quoted text

Andres, what do you think about making one of those skink configuration
changes? Alternatively, I could make the test poll until archiving catches
up. However, that would take skink about 30min, and I expect little value
from 30min of valgrind instrumenting the "cp" command.

#58Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#56)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

Hi,

On 2025-04-20 14:53:39 -0700, Noah Misch wrote:

On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:

On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Tom and Michael, do you still object to the test addition, or not? If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

Pushed as commit 714bd9e. The failure so far is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2025-04-20%2015%3A36%3A35
with these highlights:

pg_ctl: server does not shut down

2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"

The checkpoints and WAL creation took 30s, but archiving was only 20% done
(based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.

Huh. That seems surprisingly slow, even for valgrind. I guess it's one more
example for why the single-threaded archiving approach sucks so badly :)

I can reproduce this if I test with valgrind --trace-children=yes. With my
normal valgrind settings, the whole test file takes only 18s. I recommend
one of these changes to skink:

- Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
"sh" and "cp" commands.
- Remove --trace-children=yes

Hm. I think I used --trace-children=yes because I was thinking it was required
to track forks. But a newer version of valgrind's man page has an important
clarification:
--trace-children=<yes|no> [default: no]
When enabled, Valgrind will trace into sub-processes initiated via the exec system call. This is necessary for multi-process programs.

Note that Valgrind does trace into the child of a fork (it would be difficult not to, since fork makes an identical copy of a process), so this
option is arguably badly named. However, most children of fork calls immediately call exec anyway.

So there doesn't seem to be much point in using --trace-children=yes.

Andres, what do you think about making one of those skink configuration
changes? Alternatively, I could make the test poll until archiving catches
up. However, that would take skink about 30min, and I expect little value
from 30min of valgrind instrumenting the "cp" command.

I just changed the config to --trace-children=no. There already is a valgrind
run in progress, so it won't be in effect for the next run.

Greetings,

Andres Freund

#59Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#58)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Fri, Apr 25, 2025 at 03:35:06PM -0400, Andres Freund wrote:

On 2025-04-20 14:53:39 -0700, Noah Misch wrote:

The checkpoints and WAL creation took 30s, but archiving was only 20% done
(based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.

Huh. That seems surprisingly slow, even for valgrind. I guess it's one more
example for why the single-threaded archiving approach sucks so badly :)

Yes! I also didn't expect that v14/v13 would run much faster.

I just changed the config to --trace-children=no. There already is a valgrind
run in progress, so it won't be in effect for the next run.

Works for me. I see that resolved things.