odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Started by Andres Freundabout 3 years ago44 messages

andres@anarazel.de

about 3 years ago

Hi,

My buildfarm animal grassquit just showed an odd failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-11-22%2016%3A33%3A57 in REL_11_STABLE:

ok 10 - standby is in recovery
# Running: pg_ctl -D /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata promote
waiting for server to promote....pg_ctl: control file appears to be corrupt
not ok 11 - pg_ctl promote of standby runs

# Failed test 'pg_ctl promote of standby runs'
# at /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 474.

I didn't find other references to this kind of failure. Nor has the error
re-occurred on grassquit.

I don't immediately see a way for this message to be hit that's not indicating
a bug somewhere. We should be updating the control file in an atomic way and
read it in an atomic way.

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

Greetings,

Andres Freund

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-11-22%2016%3A33%3A57

Michael Paquier

michael@paquier.xyz

about 3 years ago

In reply to: Andres Freund (#1)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Nov 22, 2022 at 05:42:24PM -0800, Andres Freund wrote:

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

That's the one under -fsanitize=address. It really smells to me like
a bug with a race condition all over it.
--
Michael

Alvaro Herrera

alvherre@alvh.no-ip.org

about 3 years ago

In reply to: Andres Freund (#1)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On 2022-Nov-22, Andres Freund wrote:

ok 10 - standby is in recovery
# Running: pg_ctl -D /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata promote
waiting for server to promote....pg_ctl: control file appears to be corrupt
not ok 11 - pg_ctl promote of standby runs

# Failed test 'pg_ctl promote of standby runs'
# at /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 474.

This triggered me on this proposal I saw yesterday
/messages/by-id/02fe0063-bf77-90d0-3cf5-e9fe7c2a487b@postgrespro.ru
I think trying to store more stuff in pg_control is dangerous and we
should resist it.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Andres Freund (#1)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Wed, Nov 23, 2022 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

I assume this is ext4. Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_control_system(), ... by reading without interlocking against
writes could see garbage. I have lost track of the versions and the
thread, but I worked out at some point by experimentation that this
only started relatively recently for concurrent read() and write(),
but always happened with concurrent pread() and pwrite(). The control
file uses the non-p variants which didn't mash old/new data like
grated cheese under concurrency due to some implementation detail, but
now does.

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#4)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Nov 23, 2022 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

I assume this is ext4. Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_control_system(), ... by reading without interlocking against
writes could see garbage. I have lost track of the versions and the
thread, but I worked out at some point by experimentation that this
only started relatively recently for concurrent read() and write(),
but always happened with concurrent pread() and pwrite(). The control
file uses the non-p variants which didn't mash old/new data like
grated cheese under concurrency due to some implementation detail, but
now does.

As for what to do about it, some ideas:

1. Use advisory range locking. (This would be an advisory version of
what many other filesystems do automatically, AFAIK. Does Windows
have a thing like POSIX file locking, or need it here?)
2. Retry after a short time on checksum failure. The probability is
already miniscule, and becomes pretty close to 0 if we read thrice
100ms apart.
3. Some scheme that involves renaming the file into place. (That
might be a pain on Windows; it only works for the relmap thing because
all readers and writers are in the backend and use an LWLock to avoid
silly handle semantics.)
4. ???

First thought is that 2 is appropriate level of complexity for this
rare and stupid problem.

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#5)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I assume this is ext4. Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_control_system(), ... by reading without interlocking against
writes could see garbage. I have lost track of the versions and the
thread, but I worked out at some point by experimentation that this
only started relatively recently for concurrent read() and write(),
but always happened with concurrent pread() and pwrite(). The control
file uses the non-p variants which didn't mash old/new data like
grated cheese under concurrency due to some implementation detail, but
now does.

Ugh.

As for what to do about it, some ideas:
2. Retry after a short time on checksum failure. The probability is
already miniscule, and becomes pretty close to 0 if we read thrice
100ms apart.

First thought is that 2 is appropriate level of complexity for this
rare and stupid problem.

Yeah, I was thinking the same. A variant could be "repeat until
we see the same calculated checksum twice".

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#6)

2 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
As for what to do about it, some ideas:
2. Retry after a short time on checksum failure. The probability is
already miniscule, and becomes pretty close to 0 if we read thrice
100ms apart.

First thought is that 2 is appropriate level of complexity for this
rare and stupid problem.

Yeah, I was thinking the same. A variant could be "repeat until
we see the same calculated checksum twice".

Hmm. While writing a comment to explain why that's good enough, I
realised it's not really true for a standby that control file writes
are always expected to be far apart in time. XLogFlush->
UpdateMinRecoveryPoint() could coincide badly with our N attempts for
any small N and for any nap time, which I think makes your idea better
than mine.

With some cartoon-level understanding of what's going on (to wit: I
think the kernel just pins the page but doesn't use a page-level
content lock or range lock, so what you're seeing is raw racing memcpy
calls and unsynchronised cache line effects), I guess you'd be fairly
likely to make "progress" in seeing more new data even if you didn't
sleep in between, but who knows. So I have a 10ms sleep to make
progress very likely; given your algorithm it doesn't matter if you
didn't make all the progress, just some. Since this is reachable from
SQL, I think we also need a CFI call so you can't get uninterruptibly
stuck here?

I wrote a stupid throw-away function to force a write. If you have an
ext4 system to hand (xfs, zfs, apfs, ufs, others don't suffer from
this) you can do:

do $$ begin for i in 1..100000000 do loop perform
pg_update_control_file(); end loop; end; $$;

... while you also do:

select pg_control_system();
\watch 0.001

... and you'll soon see:

ERROR: calculated CRC checksum does not match value stored in file

The attached draft patch fixes it.

Attachments:

0001-XXX-Dirty-hack-to-clobber-control-file-for-testing.patchtext/x-patch; charset=US-ASCII; name=0001-XXX-Dirty-hack-to-clobber-control-file-for-testing.patchDownload

From 9f1856aeb56be888123702fe471d8388da66439f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 24 Nov 2022 00:55:03 +0000
Subject: [PATCH 1/2] XXX Dirty hack to clobber control file for testing

---
 src/backend/access/transam/xlog.c | 10 ++++++++++
 src/include/catalog/pg_proc.dat   |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..88de05ab35 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2502,6 +2502,16 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	LWLockRelease(ControlFileLock);
 }
 
+Datum
+pg_update_control_file()
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->minRecoveryPoint++;		/* XXX change something to affect CRC! */
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+	PG_RETURN_VOID();
+}
+
 /*
  * Ensure that all XLOG data through the given position is flushed to disk.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f15aa2dbb1..8177c1657c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11679,6 +11679,14 @@
   proargnames => '{pg_control_version,catalog_version_no,system_identifier,pg_control_last_modified}',
   prosrc => 'pg_control_system' },
 
+{ oid => '8888',
+  descr => 'update control file',
+  proname => 'pg_update_control_file', provolatile => 'v', prorettype => 'void',
+  proargtypes => '', proallargtypes => '',
+  proargmodes => '{}',
+  proargnames => '{}',
+  prosrc => 'pg_update_control_file' },
+
 { oid => '3442',
   descr => 'pg_controldata checkpoint state information as a function',
   proname => 'pg_control_checkpoint', provolatile => 'v',
-- 
2.35.1

0002-Try-to-tolerate-concurrent-reads-and-writes-of-contr.patchtext/x-patch; charset=US-ASCII; name=0002-Try-to-tolerate-concurrent-reads-and-writes-of-contr.patchDownload

From a63818a32d661dba563cedfdb85731e522b3c6a9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 24 Nov 2022 13:28:22 +1300
Subject: [PATCH 2/2] Try to tolerate concurrent reads and writes of control
 file.

Various frontend programs and SQL-callable backend functions read the
control file without any kind of interlocking against concurrent writes.
Linux ext4 doesn't implement the atomicity required by POSIX here, so a
concurrent reader can see only partial effects of an in-progress write.

Tolerate this by retrying until we get two reads in a row with the same
checksum, after an idea from Tom Lane.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/common/controldata_utils.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 2d1f35bbd1..200d24df02 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -56,12 +56,19 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 	char		ControlFilePath[MAXPGPATH];
 	pg_crc32c	crc;
 	int			r;
+	bool		first_try;
+	pg_crc32c	last_crc;
 
 	Assert(crc_ok_p);
 
 	ControlFile = palloc_object(ControlFileData);
 	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
 
+	first_try = true;
+	INIT_CRC32C(last_crc);
+
+retry:
+
 #ifndef FRONTEND
 	if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1)
 		ereport(ERROR,
@@ -117,6 +124,24 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 
 	*crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);
 
+	/*
+	 * With unlucky timing on filesystems that don't implement atomicity of
+	 * concurrent reads and writes (such as Linux ext4), we might have seen
+	 * garbage if the server was writing to the file at the same time.  Keep
+	 * retrying until we see the same CRC twice.
+	 */
+	if (!*crc_ok_p && (first_try || !EQ_CRC32C(crc, last_crc)))
+	{
+		first_try = false;
+		last_crc = crc;
+		pg_usleep(10000);
+
+#ifndef FRONTEND
+		CHECK_FOR_INTERRUPTS();
+#endif
+		goto retry;
+	}
+
 	/* Make sure the control file is valid byte order. */
 	if (ControlFile->pg_control_version % 65536 == 0 &&
 		ControlFile->pg_control_version / 65536 != 0)
-- 
2.35.1

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#7)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Thu, Nov 24, 2022 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... and you'll soon see:

ERROR: calculated CRC checksum does not match value stored in file

I forgot to mention: this reproducer only seems to work if fsync =
off. I don't know why, but I recall that was true also for bug
#17064.

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Thomas Munro (#7)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hello!

On 24.11.2022 04:02, Thomas Munro wrote:

On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

ERROR: calculated CRC checksum does not match value stored in file

The attached draft patch fixes it.

Tried to catch this error on my PC, but failed to do it within a reasonable time.
The 1s interval is probably too long for me.
It seems there are more durable way to reproduce this bug with 0001 patch applied:
At the first backend:

do $$ begin loop perform pg_update_control_file(); end loop; end; $$;

At the second one:

do $$ begin loop perform pg_control_system(); end loop; end; $$;

It will fails almost immediately with:
"ERROR: calculated CRC checksum does not match value stored in file"
both with fsync = off and fsync = on.
Checked it out for master and REL_11_STABLE.

Also checked for a few hours that the patch 0002 fixes this error,
but there are some questions to its logical structure.
The equality between the previous and newly calculated crc is checked only
if the last crc calculation was wrong, i.e not equal to the value stored in the file.
It is very unlikely that in this case the previous and new crc can match, so, in fact,
the loop will spin until crc is calculated correctly. In the other words,
this algorithm is practically equivalent to an infinite loop of reading from a file
and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).
But then it can be simplified significantly by removing checksums equality checks,
bool fist_try and by limiting the maximum number of iterations
with some constant in the e.g. for loop to avoid theoretically possible freeze.

Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
repeat the file_open-read-close-calculate_crc sequence twice without a pause between
them and check the both calculated crcs for the equality. If they match, exit and return
the bool result of comparing between the last calculation with the value from the file,
if not, take a pause and repeat everything from the beginning.
In this case, no need to check *crc_ok_p inside get_controlfile()
as it was in the present version; i think it's more logically correct
since this variable is intended top-level functions and the logic
inside get_controlfile() should not depend on its value.

Also found a warning in 0001 patch for master branch. On my PC gcc gives:

xlog.c:2507:1: warning: no previous prototype for ‘pg_update_control_file’ [-Wmissing-prototypes]
2507 | pg_update_control_file()

Fixed it with #include "utils/fmgrprotos.h" to xlog.c and
add PG_FUNCTION_ARGS to pg_update_control_file().

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Anton A. Melnikov (#9)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Jan 31, 2023 at 2:10 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

Also checked for a few hours that the patch 0002 fixes this error,
but there are some questions to its logical structure.

Hi Anton,

Thanks for looking!

The equality between the previous and newly calculated crc is checked only
if the last crc calculation was wrong, i.e not equal to the value stored in the file.
It is very unlikely that in this case the previous and new crc can match, so, in fact,
the loop will spin until crc is calculated correctly. In the other words,
this algorithm is practically equivalent to an infinite loop of reading from a file
and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).

Maybe it's unlikely that two samples will match while running that
torture test, because it's overwriting the file as fast as it can.
But the idea is that a real system isn't writing the control file most
of the time.

But then it can be simplified significantly by removing checksums equality checks,
bool fist_try and by limiting the maximum number of iterations
with some constant in the e.g. for loop to avoid theoretically possible freeze.

Yeah, I was thinking that we should also put a limit on the loop, just
to be cautious.

Primary servers write the control file very infrequently. Standybys
more frequently, while writing data out, maybe every few seconds on a
busy system writing out a lot of data. UpdateMinRecoveryPoint() makes
some effort to avoid updating the file too often. You definitely see
bursts of repeated flushes that might send this thing in a loop for a
while if the timings were exactly wrong, but usually with periodic
gaps; I haven't really studied the expected behaviour too closely.

Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
repeat the file_open-read-close-calculate_crc sequence twice without a pause between
them and check the both calculated crcs for the equality. If they match, exit and return
the bool result of comparing between the last calculation with the value from the file,
if not, take a pause and repeat everything from the beginning.

Hmm. Would it be good enough to do two read() calls with no sleep in
between? How sure are we that a concurrent write will manage to
change at least one bit that our second read can see? I guess it's
likely, but maybe hypervisors, preemptible kernels, I/O interrupts or
a glitch in the matrix could decrease our luck? I really don't know.
So I figured I should add a small sleep between the reads to change
the odds in our favour. But we don't want to slow down all reads of
the control file with a sleep, do we? So I thought we should only
bother doing this slow stuff if the actual CRC check fails, a low
probability event.

Clearly there is an element of speculation or superstition here. I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking. Maybe we should rethink that. How bad would it
really be if control file access used POSIX file locking? I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.

#11

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#10)

1 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Clearly there is an element of speculation or superstition here. I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking. Maybe we should rethink that. How bad would it
really be if control file access used POSIX file locking? I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.

Here's an experimental patch for that alternative. I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem? It's less back-patchable, but maybe more
principled?

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().

Attachments:

0001-Lock-pg_control-while-reading-or-writing.patchtext/x-patch; charset=US-ASCII; name=0001-Lock-pg_control-while-reading-or-writing.patchDownload

From fe71350b0874adbec97c612e7b80e7179c6c48e9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH 1/2] Lock pg_control while reading or writing.

Front-end programs that read pg_control and user-facing functions run
in the backend read pg_control without acquiring ControlFileLock.  If
you're unlucky enough to read() while the backend is in write(), on at
least Linux ext4 you might see partial data.  Use a POSIX advisory lock
to avoid this problem.
---
 src/common/controldata_utils.c | 69 ++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..284895bdd9 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -39,6 +39,42 @@
 #include "storage/fd.h"
 #endif
 
+/*
+ * Advisory-lock the control file until closed.
+ */
+static int
+lock_file(int fd, bool exclusive)
+{
+#ifdef WIN32
+	/*
+	 * LockFile() might work, but it seems dangerous because there are reports
+	 * that the lock can sometimes linger if a program crashes without
+	 * unlocking.  That would prevent recovery after a PANIC, so we'd better
+	 * not use it without more research.  Because of the lack of portability,
+	 * this function is kept private to controldata_utils.c.
+	 */
+	return 0;
+#else
+	struct flock lock;
+	int		rc;
+
+	memset(&lock, 0, sizeof(lock));
+	lock.l_type = exclusive ? F_WRLCK : F_RDLCK;
+	lock.l_start = 0;
+	lock.l_whence = SEEK_SET;
+	lock.l_len = 0;
+	lock.l_pid = -1;
+
+	do
+	{
+		rc = fcntl(fd, F_SETLKW, &lock);
+	}
+	while (rc < 0 && errno == EINTR);
+
+	return rc;
+#endif
+}
+
 /*
  * get_controlfile()
  *
@@ -74,6 +110,20 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 				 ControlFilePath);
 #endif
 
+	/* Make sure we can read the file atomically. */
+	if (lock_file(fd, false) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for reading: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for reading: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	r = read(fd, ControlFile, sizeof(ControlFileData));
 	if (r != sizeof(ControlFileData))
 	{
@@ -186,6 +236,25 @@ update_controlfile(const char *DataDir,
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
 
+	/*
+	 * Make sure that any concurrent reader (including frontend programs) can
+	 * read the file atomically.  Note that this refers to atomicity of
+	 * concurrent reads and writes.  For our assumption of atomicity under
+	 * power failure, see PG_CONTROL_MAX_SAFE_SIZE.
+	 */
+	if (lock_file(fd, true) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for writing: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for writing: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	errno = 0;
 #ifndef FRONTEND
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
-- 
2.35.1

#12

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Thomas Munro (#11)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

There are two variants of the patch now.

1) As for the first workaround:

On 31.01.2023 07:09, Thomas Munro wrote:

Maybe it's unlikely that two samples will match while running that
torture test, because it's overwriting the file as fast as it can.
But the idea is that a real system isn't writing the control file most
of the time.

........

Yeah, I was thinking that we should also put a limit on the loop, just
to be cautious.

At first i didn’t understand that the equality condition with the previous
calculated crc and the current one at the second+ attempts was intended
for the case when the pg_control file is really corrupted.

Indeed, by making a few debugging variables and running the tortue test,
i found that there were ~4000 crc errors (~0.0003%) in ~125 million reads from the file,
and there was no case when the crc error appeared twice in a row.
So the second and moreover the third successive occurrence of an crc error
can be neglected and for this workaround seems a simpler and maybe more clear
algorithm is possible.
For instance:

for(try = 0 ; try < 3; try++)
{
open, read from and close pg_control;
calculate crc;

*crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);

if(*crc_ok_p)
break;
}

2) As for the second variant of the patch with POSIX locks:

On 31.01.2023 14:38, Thomas Munro wrote:

On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Clearly there is an element of speculation or superstition here. I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking. Maybe we should rethink that. How bad would it
really be if control file access used POSIX file locking? I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.

Here's an experimental patch for that alternative. I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem? It's less back-patchable, but maybe more
principled?

It looks very strange to me that there may be cases where the cluster data
is stored in NFS. Can't figure out how this can be useful.

i think this variant of the patch is a normal solution
of the problem, not workaround. Found no problems on Linux.
+1 for this variant.

Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

Unfortunately i've not had time to reproduce the problem and apply this patch on
Windows yet but i'm going to do it soon on windows 10. If a crc error
will occur there, then we might use the workaround from the first
variant of the patch.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().

May be choose it in accordance with GUC wal_sync_method?

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Anton A. Melnikov (#12)

1 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Wed, Feb 1, 2023 at 5:04 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

On 31.01.2023 14:38, Thomas Munro wrote:

Here's an experimental patch for that alternative. I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem? It's less back-patchable, but maybe more
principled?

It looks very strange to me that there may be cases where the cluster data
is stored in NFS. Can't figure out how this can be useful.

Heh. There are many interesting failure modes, but people do it. I
guess my more general question when introducing any new system call
into this code is how some unusual system I can't even access is going
to break. Maybe some obscure filesystem will fail with EOPNOTSUPP, or
take 5 seconds and then fail because there is no lock server
configured or whatever, so that's why I don't think we can back-patch
it, and we probably need a way to turn it off.

i think this variant of the patch is a normal solution
of the problem, not workaround. Found no problems on Linux.
+1 for this variant.

I prefer it too.

Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?

Ah, good thought. I think it shouldn't happen™, so it's OK that
errcode_for_file_access() would classify it as ERRCODE_INTERNAL_ERROR.

Other interesting errors are:

ENOLCK: system limits exceeded; PANIC seems reasonable
EOPNOTSUPP: this file doesn't support locking (seen on FreeBSD man
pages, not on POSIX)

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

Unfortunately i've not had time to reproduce the problem and apply this patch on
Windows yet but i'm going to do it soon on windows 10. If a crc error
will occur there, then we might use the workaround from the first
variant of the patch.

Thank you for investigating. I am afraid to read your results.

One idea would be to proceed with LockFile() for Windows, with a note
suggesting you file a bug with your OS vendor if you ever need it to
get unstuck. Googling this subject, I read that MongoDB used to
suffer from stuck lock files, until an OS bug report led to recent
versions releasing locks more promptly. I find that sufficiently
scary that I would want to default the feature to off on Windows, even
if your testing shows that it does really need it.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().

May be choose it in accordance with GUC wal_sync_method?

Here's a patch like that. I don't know if it's a good idea for
wal_sync_method to affect other kinds of files or not, but, then, it
already does (fsync_writethough changes this behaviour). I would
only want to consider this if we also stop choosing "open_datasync" as
a default on at least Windows.

Attachments:

0001-Apply-wal_sync_method-to-pg_control-file.patchtext/x-patch; charset=US-ASCII; name=0001-Apply-wal_sync_method-to-pg_control-file.patchDownload

From f55c98daefc4eec7f05413edf92b3b1e7070e1ef Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 21:08:12 +1300
Subject: [PATCH] Apply wal_sync_method to pg_control file.

When writing out the control file frequently on a standby, we can go a
little faster on some systems by using fdatasync() instead of fsync(),
with no loss of safety.  We already respected the special
"wal_sync_method=fsync_writethrough" via a circuitous route, but we
might as well support the full range of methods.
---
 src/backend/access/transam/xlog.c      | 19 ++++++++-
 src/common/controldata_utils.c         | 59 ++++++++++++++++++++------
 src/include/common/controldata_utils.h |  7 ++-
 3 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..d0a7a3d7eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4165,7 +4165,24 @@ ReadControlFile(void)
 static void
 UpdateControlFile(void)
 {
-	update_controlfile(DataDir, ControlFile, true);
+	int			sync_op;
+
+	switch (sync_method)
+	{
+		case SYNC_METHOD_FDATASYNC:
+			sync_op = UPDATE_CONTROLFILE_FDATASYNC;
+			break;
+		case SYNC_METHOD_OPEN:
+			sync_op = UPDATE_CONTROLFILE_O_SYNC;
+			break;
+		case SYNC_METHOD_OPEN_DSYNC:
+			sync_op = UPDATE_CONTROLFILE_O_DSYNC;
+			break;
+		default:
+			sync_op = UPDATE_CONTROLFILE_FSYNC;
+	}
+
+	update_controlfile(DataDir, ControlFile, sync_op);
 }
 
 /*
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..1bc1c6f8d4 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -136,18 +136,39 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
  * update_controlfile()
  *
  * Update controlfile values with the contents given by caller.  The
- * contents to write are included in "ControlFile". "do_sync" can be
- * optionally used to flush the updated control file.  Note that it is up
- * to the caller to properly lock ControlFileLock when calling this
- * routine in the backend.
+ * contents to write are included in "ControlFile".  "sync_op" can be set
+ * to 0 or a synchronization method UPDATE_CONTROLFILE_XXX.  In frontend code,
+ * only 0 and non-zero (fsync) are meaningful.
+ * Note that it is up to the caller to properly lock ControlFileLock when
+ * calling this routine in the backend.
  */
 void
 update_controlfile(const char *DataDir,
-				   ControlFileData *ControlFile, bool do_sync)
+				   ControlFileData *ControlFile,
+				   int sync_op)
 {
 	int			fd;
 	char		buffer[PG_CONTROL_FILE_SIZE];
 	char		ControlFilePath[MAXPGPATH];
+	int			open_flag;
+
+	switch (sync_op)
+	{
+#ifndef FRONTEND
+#ifdef O_SYNC
+		case UPDATE_CONTROLFILE_O_SYNC:
+			open_flag = O_SYNC;
+			break;
+#endif
+#ifdef O_DSYNC
+		case UPDATE_CONTROLFILE_O_DSYNC:
+			open_flag = O_DSYNC;
+			break;
+#endif
+#endif
+		default:
+			open_flag = 0;
+	}
 
 	/* Update timestamp  */
 	ControlFile->time = (pg_time_t) time(NULL);
@@ -175,13 +196,13 @@ update_controlfile(const char *DataDir,
 	 * All errors issue a PANIC, so no need to use OpenTransientFile() and to
 	 * worry about file descriptor leaks.
 	 */
-	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY)) < 0)
+	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY | open_flag)) < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						ControlFilePath)));
 #else
-	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY,
+	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY | open_flag,
 				   pg_file_create_mode)) == -1)
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
@@ -209,17 +230,29 @@ update_controlfile(const char *DataDir,
 	pgstat_report_wait_end();
 #endif
 
-	if (do_sync)
+	if (sync_op != 0)
 	{
 #ifndef FRONTEND
 		pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE);
-		if (pg_fsync(fd) != 0)
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							ControlFilePath)));
+		if (sync_op == UPDATE_CONTROLFILE_FDATASYNC)
+		{
+			if (fdatasync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
+		else if (sync_op == UPDATE_CONTROLFILE_FSYNC)
+		{
+			if (pg_fsync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
 		pgstat_report_wait_end();
 #else
+		/* In frontend code, non-zero sync_op gets you plain fsync(). */
 		if (fsync(fd) != 0)
 			pg_fatal("could not fsync file \"%s\": %m", ControlFilePath);
 #endif
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 49e7c52d31..879455c8fb 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -12,8 +12,13 @@
 
 #include "catalog/pg_control.h"
 
+#define UPDATE_CONTROLFILE_FSYNC 1
+#define UPDATE_CONTROLFILE_FDATASYNC 2
+#define UPDATE_CONTROLFILE_O_SYNC 3
+#define UPDATE_CONTROLFILE_O_DSYNC 4
+
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
-							   ControlFileData *ControlFile, bool do_sync);
+							   ControlFileData *ControlFile, int sync_op);
 
 #endif							/* COMMON_CONTROLDATA_UTILS_H */
-- 
2.35.1

#14

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Thomas Munro (#13)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

Thanks for your rapid answer and sorry for my delay with reply.

On 01.02.2023 09:45, Thomas Munro wrote:

Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?

Ah, good thought. I think it shouldn't happen™, so it's OK that
errcode_for_file_access() would classify it as ERRCODE_INTERNAL_ERROR.

Yes, i also think that is impossible since the lock is taken on
the entire file, so ERRCODE_INTERNAL_ERROR will be right here.

Other interesting errors are:

ENOLCK: system limits exceeded; PANIC seems reasonable
EOPNOTSUPP: this file doesn't support locking (seen on FreeBSD man
pages, not on POSIX)

Agreed that ENOLCK is a PANIC or at least FATAL. Maybe it's even better
to do it FATAL to allow other backends to survive?
As for EOPNOTSUPP, maybe make a fallback to the workaround from the
first variant of the patch? (In my previous letter i forgot the pause
after break;, of cause)

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

Unfortunately i've not had time to reproduce the problem and apply this patch on
Windows yet but i'm going to do it soon on windows 10. If a crc error
will occur there, then we might use the workaround from the first
variant of the patch.

Thank you for investigating. I am afraid to read your results.

First of all it seemed to me that is not a problem at all since msdn
guarantees sector-by-sector atomicity.
"Physical Sector: The unit for which read and write operations to the device
are completed in a single operation. This is the unit of atomic write..."
https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
(Of course, only if the 512 bytes lays from the beginning of the file with a zero
offset, but this is our case. The current size of ControlFileData is
296 bytes at offset = 0.)

I tried to verify this fact experimentally and was very surprised.
Unfortunately it crashed in an hour during torture test:
2023-02-13 15:10:52.675 MSK [5704] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 15:10:52.768 MSK [5704] LOG: database system is ready to accept connections
@@@@@@ sizeof(ControlFileData) = 296
.........
2023-02-13 16:10:41.997 MSK [9380] ERROR: calculated CRC checksum does not match value stored in file

But fortunately, this only happens when fsync=off.
Also i did several experiments with fsync=on and found more appropriate behavior:
The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a 4,5 hours,
but the other one with ordinary sizeof(ControlFileData) = 296 not crashed in more than 12 hours.

Seems in that case the behavior corresponds to msdn. So if it is possible
to use fsync() under windows when the GUC fsync is off it maybe a solution
for this problem. If so there is no need to lock the pg_control file under windows at all.

May be choose it in accordance with GUC wal_sync_method?

Here's a patch like that. I don't know if it's a good idea for
wal_sync_method to affect other kinds of files or not, but, then, it
already does (fsync_writethough changes this behaviour).

+1. Looks like it needs a little fix:

+++ b/src/common/controldata_utils.c
@@ -316,7 +316,7 @@ update_controlfile(const char *DataDir,
                         if (pg_fsync(fd) != 0)
                                 ereport(PANIC,
                                                 (errcode_for_file_access(),
-                                                errmsg("could not fdatasync file \"%s\": %m",
+                                                errmsg("could not fsync file \"%s\": %m",
                                                                 ControlFilePath)));

And it may be combined with 0001-Lock-pg_control-while-reading-or-writing.patch

I would
only want to consider this if we also stop choosing "open_datasync" as
a default on at least Windows.

I didn't quite understand this point. Could you clarify it for me, please? If the performance
of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms.

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#15

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Anton A. Melnikov (#14)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

On 14.02.2023 06:38, Anton A. Melnikov wrote:

Also i did several experiments with fsync=on and found more appropriate behavior:
The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a 4,5 hours,
but the other one with ordinary sizeof(ControlFileData) = 296 not crashed in more than 12 hours.

Nonetheless it crashed after 18 hours:

2023-02-13 18:07:21.476 MSK [7640] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 18:07:21.483 MSK [7640] LOG: listening on IPv6 address "::1", port 5432
2023-02-13 18:07:21.483 MSK [7640] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-02-13 18:07:21.556 MSK [1940] LOG: database system was shut down at 2023-02-13 18:07:12 MSK
2023-02-13 18:07:21.590 MSK [7640] LOG: database system is ready to accept connections
@@@@@@@@@@@@@ sizeof(ControlFileData) = 296
2023-02-13 18:12:21.545 MSK [9532] LOG: checkpoint starting: time
2023-02-13 18:12:21.583 MSK [9532] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.009 s, total=0.038 s; sync files=2, longest=0.005 s, average=0.005 s; distance=0 kB, estimate=0 kB; lsn=0/17AC388, redo lsn=0/17AC350
2023-02-14 12:12:21.738 MSK [8676] ERROR: calculated CRC checksum does not match value stored in file
2023-02-14 12:12:21.738 MSK [8676] CONTEXT: SQL statement "SELECT pg_control_system()"
PL/pgSQL function inline_code_block line 1 at PERFORM
2023-02-14 12:12:21.738 MSK [8676] STATEMENT: do $$ begin loop perform pg_control_system(); end loop; end; $$;

So all of the following is incorrect:

Seems in that case the behavior corresponds to msdn. So if it is possible
to use fsync() under windows when the GUC fsync is off it maybe a solution
for this problem. If so there is no need to lock the pg_control file under windows at all.

and cannot be a solution.

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#16

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Anton A. Melnikov (#14)

3 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

First of all it seemed to me that is not a problem at all since msdn
guarantees sector-by-sector atomicity.
"Physical Sector: The unit for which read and write operations to the device
are completed in a single operation. This is the unit of atomic write..."
https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
(Of course, only if the 512 bytes lays from the beginning of the file with a zero
offset, but this is our case. The current size of ControlFileData is
296 bytes at offset = 0.)

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing (BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

With this patch we would stop relying on the second thing. Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking). Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

I tried to verify this fact experimentally and was very surprised.
Unfortunately it crashed in an hour during torture test:
2023-02-13 15:10:52.675 MSK [5704] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 15:10:52.768 MSK [5704] LOG: database system is ready to accept connections
@@@@@@ sizeof(ControlFileData) = 296
.........
2023-02-13 16:10:41.997 MSK [9380] ERROR: calculated CRC checksum does not match value stored in file

Thanks. I'd seen reports of this in discussions on the 'net, but
those had no authoritative references or supporting test results. The
fact that fsync made it take longer (in your following email) makes
sense and matches Linux. It inserts a massive pause in the middle of
the experiment loop, affecting the probabilities.

Therefore, we need a solution for Windows too. I tried to write the
equivalent code, in the attached. I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix). Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that. I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do. Do you have
any ideas?

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file. Hrmph. Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

I would
only want to consider this if we also stop choosing "open_datasync" as
a default on at least Windows.

I didn't quite understand this point. Could you clarify it for me, please? If the performance
of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms.

The level of durability would be weakened on Windows. On all systems
except Linux and FreeBSD, we default to wal_sync_method=open_datasync,
so then we would start using FILE_FLAG_WRITE_THROUGH for the control
file too. We know from reading and experimentation that
FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer
Windows hardware, but its fdatasync-like thing does[2]/messages/by-id/CA+hUKG+a-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw@mail.gmail.com. I have not
thought too hard about the consequences of using different durability
levels for different categories of file, but messing with write
ordering can't be good for crash recovery, so I'd rather increase WAL
durability than decrease control file durability. If a Windows user
complains that it makes their fancy non-volatile cache slow down, they
can always adjust the settings in PostgreSQL, their OS, or their
drivers etc. I think we should just make fdatasync the default on all
systems.

Here's a patch like that.

[1]: https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access
[2]: /messages/by-id/CA+hUKG+a-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw@mail.gmail.com

Attachments:

0001-Lock-pg_control-while-reading-or-writing.patchtext/x-patch; charset=US-ASCII; name=0001-Lock-pg_control-while-reading-or-writing.patchDownload

From 816760e52a1233b29674869205ce9283aab28a73 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH 1/3] Lock pg_control while reading or writing.

Front-end programs that read pg_control and user-facing functions run
in the backend read pg_control without acquiring ControlFileLock.  If
you're unlucky enough to read() while the backend is in write(), on at
least ext4 or NTFS, you might see partial data.  Use a POSIX advisory
file lock or a Windows mandatory file lock, to avoid this problem.

Since the semantics of POSIX and Windows file locks are different
and error-prone, don't try to provide a general purpose reusable
function for those operations; keep them close to the control file code.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/access/transam/xlog.c      |  21 +++++
 src/common/controldata_utils.c         | 126 +++++++++++++++++++++++++
 src/include/common/controldata_utils.h |   4 +
 3 files changed, 151 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..9b9c3294e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3980,6 +3980,19 @@ ReadControlFile(void)
 				 errmsg("could not open file \"%s\": %m",
 						XLOG_CONTROL_FILE)));
 
+	/*
+	 * Since file locks are released asynchronously after a program crashes on
+	 * Windows, there is a small chance that the write lock of a previously
+	 * running postmaster hasn't been released yet.  We need to wait for that
+	 * here; if we didn't, our read might fail, because Windows file locks are
+	 * mandatory (that is, not advisory).
+	 */
+	if (lock_controlfile(fd, false) < 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\": %m",
+						XLOG_CONTROL_FILE)));
+
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_READ);
 	r = read(fd, ControlFile, sizeof(ControlFileData));
 	if (r != sizeof(ControlFileData))
@@ -3997,6 +4010,14 @@ ReadControlFile(void)
 	}
 	pgstat_report_wait_end();
 
+#ifdef WIN32
+	if (unlock_controlfile(fd) < 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not unlock file \"%s\": %m",
+						XLOG_CONTROL_FILE)));
+#endif
+
 	close(fd);
 
 	/*
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..45b43ae546 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -39,6 +39,91 @@
 #include "storage/fd.h"
 #endif
 
+/*
+ * Lock the control file until closed.  This should only be used on file where
+ * only one descriptor is open in a given process (due to weird semantics of
+ * POSIX locks).  On Windows, see unlock_controlfile(), but otherwise the lock
+ * is held until the file is closed.
+ */
+int
+lock_controlfile(int fd, bool exclusive)
+{
+#ifdef WIN32
+	OVERLAPPED	overlapped = {0};
+	HANDLE		handle;
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	/* Lock first byte. */
+	if (!LockFileEx(handle,
+					exclusive ? LOCKFILE_EXCLUSIVE_LOCK : 0,
+					0,
+					1,
+					0,
+					&overlapped))
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+
+	return 0;
+#else
+	struct flock lock;
+	int			rc;
+
+	memset(&lock, 0, sizeof(lock));
+	lock.l_type = exclusive ? F_WRLCK : F_RDLCK;
+	lock.l_start = 0;
+	lock.l_whence = SEEK_SET;
+	lock.l_len = 0;
+	lock.l_pid = -1;
+
+	do
+	{
+		rc = fcntl(fd, F_SETLKW, &lock);
+	}
+	while (rc < 0 && errno == EINTR);
+
+	return rc;
+#endif
+}
+
+#ifdef WIN32
+/*
+ * Release lock acquire with lock_controlfile().  On POSIX systems, we don't
+ * bother making an extra system call to release the lock, since it'll be
+ * released on close anyway.  On Windows, explicit release is recommended by
+ * the documentation to make sure it is done synchronously.
+ */
+int
+unlock_controlfile(int fd)
+{
+	OVERLAPPED	overlapped = {0};
+	HANDLE		handle;
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	/* Unlock first byte. */
+	if (!UnlockFileEx(handle, 0, 1, 0, &overlapped))
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+
+	return 0;
+}
+#endif
+
 /*
  * get_controlfile()
  *
@@ -74,6 +159,20 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 				 ControlFilePath);
 #endif
 
+	/* Make sure we can read the file atomically. */
+	if (lock_controlfile(fd, false) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for reading: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for reading: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	r = read(fd, ControlFile, sizeof(ControlFileData));
 	if (r != sizeof(ControlFileData))
 	{
@@ -97,6 +196,10 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 #endif
 	}
 
+#ifdef WIN32
+	unlock_controlfile(fd);
+#endif
+
 #ifndef FRONTEND
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -186,6 +289,25 @@ update_controlfile(const char *DataDir,
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
 
+	/*
+	 * Make sure that any concurrent reader (including frontend programs) can
+	 * read the file atomically.  Note that this refers to atomicity of
+	 * concurrent reads and writes.  For our assumption of atomicity under
+	 * power failure, see PG_CONTROL_MAX_SAFE_SIZE.
+	 */
+	if (lock_controlfile(fd, true) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for writing: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for writing: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	errno = 0;
 #ifndef FRONTEND
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
@@ -225,6 +347,10 @@ update_controlfile(const char *DataDir,
 #endif
 	}
 
+#ifdef WIN32
+	unlock_controlfile(fd);
+#endif
+
 	if (close(fd) != 0)
 	{
 #ifndef FRONTEND
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 49e7c52d31..06825cad85 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -15,5 +15,9 @@
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
 							   ControlFileData *ControlFile, bool do_sync);
+extern int lock_controlfile(int fd, bool exclusive);
+#ifdef WIN32
+extern int unlock_controlfile(int fd);
+#endif
 
 #endif							/* COMMON_CONTROLDATA_UTILS_H */
-- 
2.35.1

0002-Make-wal_sync_method-fdatasync-the-default-on-all-OS.patchtext/x-patch; charset=US-ASCII; name=0002-Make-wal_sync_method-fdatasync-the-default-on-all-OS.patchDownload

From fa594006c3ef7e958d870978c631b003aa680c18 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 16 Feb 2023 20:02:22 +1300
Subject: [PATCH 2/3] Make wal_sync_method=fdatasync the default on all OSes.

Previously, fdatasync was the default level on Linux and FreeBSD by
special rules, and on OpenBSD and DragonflyBSD because they didn't have
O_DSYNC != O_SYNC or O_DSYNC at all.  For every other system, we'd
choose open_datasync.

Use fdatasync everywhere, for consistency.  This became possible after
commit 9430fb40 added support for Windows, the last known system not to
have it.  This means that we'll now flush caches on consumer drives by
default on Windows, where previously we didn't, which seems like a
better default for crash safety.  Users who want the older behavior can
still request it with wal_sync_method=open_datasync.
---
 doc/src/sgml/config.sgml              |  5 ++---
 doc/src/sgml/wal.sgml                 |  8 ++++----
 src/bin/pg_test_fsync/pg_test_fsync.c |  4 ++--
 src/include/access/xlogdefs.h         | 14 +-------------
 src/include/port/freebsd.h            |  7 -------
 src/include/port/linux.h              |  8 --------
 6 files changed, 9 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ecd9aa73ef..b78d374d7e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3107,9 +3107,8 @@ include_dir 'conf.d'
        <para>
         The <literal>open_</literal>* options also use <literal>O_DIRECT</literal> if available.
         Not all of these choices are available on all platforms.
-        The default is the first method in the above list that is supported
-        by the platform, except that <literal>fdatasync</literal> is the default on
-        Linux and FreeBSD.  The default is not necessarily ideal; it might be
+        The default is <literal>fdatasync</literal>.
+        The default is not necessarily ideal; it might be
         necessary to change this setting or other aspects of your system
         configuration in order to create a crash-safe configuration or
         achieve optimal performance.
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index ed7929cbcd..ffc22a7ded 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -106,12 +106,12 @@
     <listitem>
       <para>
         On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is
-        <literal>open_datasync</literal> (the default), write caching can be disabled
+        <literal>open_datasync</literal>, write caching can be disabled
         by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>.
         Alternatively, set <varname>wal_sync_method</varname> to
-        <literal>fdatasync</literal> (NTFS only), <literal>fsync</literal> or
-        <literal>fsync_writethrough</literal>, which prevent
-        write caching.
+        <literal>fdatasync</literal> (the default), <literal>fsync</literal> or
+        <literal>fsync_writethrough</literal>, which use system calls that
+        flush write caches.
       </para>
     </listitem>
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 3d5e8f30ab..45bd2daf2e 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -292,7 +292,7 @@ test_sync(int writes_per_op)
 		printf(_("\nCompare file sync methods using one %dkB write:\n"), XLOG_BLCKSZ_K);
 	else
 		printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
-	printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n"));
+	printf(_("(fdatasync is the default)\n"));
 
 	/*
 	 * Test open_datasync if available
@@ -326,7 +326,7 @@ test_sync(int writes_per_op)
 #endif
 
 /*
- * Test fdatasync if available
+ * Test fdatasync
  */
 	printf(LABEL_FORMAT, "fdatasync");
 	fflush(stdout);
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index fe794c7740..a628976902 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -64,19 +64,7 @@ typedef uint32 TimeLineID;
  */
 typedef uint16 RepOriginId;
 
-/*
- * This chunk of hackery attempts to determine which file sync methods
- * are available on the current platform, and to choose an appropriate
- * default method.
- *
- * Note that we define our own O_DSYNC on Windows, but not O_SYNC.
- */
-#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
-#define DEFAULT_SYNC_METHOD		PLATFORM_DEFAULT_SYNC_METHOD
-#elif defined(O_DSYNC) && (!defined(O_SYNC) || O_DSYNC != O_SYNC)
-#define DEFAULT_SYNC_METHOD		SYNC_METHOD_OPEN_DSYNC
-#else
+/* Default synchronization method for WAL. */
 #define DEFAULT_SYNC_METHOD		SYNC_METHOD_FDATASYNC
-#endif
 
 #endif							/* XLOG_DEFS_H */
diff --git a/src/include/port/freebsd.h b/src/include/port/freebsd.h
index 0e3fde55d6..2e36d3da4f 100644
--- a/src/include/port/freebsd.h
+++ b/src/include/port/freebsd.h
@@ -1,8 +1 @@
 /* src/include/port/freebsd.h */
-
-/*
- * Set the default wal_sync_method to fdatasync.  xlogdefs.h's normal rules
- * would prefer open_datasync on FreeBSD 13+, but that is not a good choice on
- * many systems.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD	SYNC_METHOD_FDATASYNC
diff --git a/src/include/port/linux.h b/src/include/port/linux.h
index 7a6e46cdbb..acd867606c 100644
--- a/src/include/port/linux.h
+++ b/src/include/port/linux.h
@@ -12,11 +12,3 @@
  * to have a kernel version test here.
  */
 #define HAVE_LINUX_EIDRM_BUG
-
-/*
- * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
- * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
- * perform better and (b) causes outright failures on ext4 data=journal
- * filesystems, because those don't support O_DIRECT.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD	SYNC_METHOD_FDATASYNC
-- 
2.35.1

0003-Apply-wal_sync_method-to-pg_control-file.patchtext/x-patch; charset=US-ASCII; name=0003-Apply-wal_sync_method-to-pg_control-file.patchDownload

From c9dcb77263a7f86b187231a83083bfbe3a0f30b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 21:08:12 +1300
Subject: [PATCH 3/3] Apply wal_sync_method to pg_control file.

When writing out the control file frequently on a standby, we can go a
little faster on some systems by using fdatasync() instead of fsync(),
with no loss of safety.  We already respected the special
"wal_sync_method=fsync_writethrough" via a circuitous route, but we
might as well support the full range of methods.
---
 src/backend/access/transam/xlog.c      | 19 ++++++++-
 src/common/controldata_utils.c         | 59 ++++++++++++++++++++------
 src/include/common/controldata_utils.h |  7 ++-
 3 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9b9c3294e8..27c1cb8273 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4185,7 +4185,24 @@ ReadControlFile(void)
 static void
 UpdateControlFile(void)
 {
-	update_controlfile(DataDir, ControlFile, true);
+	int			sync_op;
+
+	switch (sync_method)
+	{
+		case SYNC_METHOD_FDATASYNC:
+			sync_op = UPDATE_CONTROLFILE_FDATASYNC;
+			break;
+		case SYNC_METHOD_OPEN:
+			sync_op = UPDATE_CONTROLFILE_O_SYNC;
+			break;
+		case SYNC_METHOD_OPEN_DSYNC:
+			sync_op = UPDATE_CONTROLFILE_O_DSYNC;
+			break;
+		default:
+			sync_op = UPDATE_CONTROLFILE_FSYNC;
+	}
+
+	update_controlfile(DataDir, ControlFile, sync_op);
 }
 
 /*
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 45b43ae546..3fc2a00d44 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -239,18 +239,39 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
  * update_controlfile()
  *
  * Update controlfile values with the contents given by caller.  The
- * contents to write are included in "ControlFile". "do_sync" can be
- * optionally used to flush the updated control file.  Note that it is up
- * to the caller to properly lock ControlFileLock when calling this
- * routine in the backend.
+ * contents to write are included in "ControlFile".  "sync_op" can be set
+ * to 0 or a synchronization method UPDATE_CONTROLFILE_XXX.  In frontend code,
+ * only 0 and non-zero (fsync) are meaningful.
+ * Note that it is up to the caller to properly lock ControlFileLock when
+ * calling this routine in the backend.
  */
 void
 update_controlfile(const char *DataDir,
-				   ControlFileData *ControlFile, bool do_sync)
+				   ControlFileData *ControlFile,
+				   int sync_op)
 {
 	int			fd;
 	char		buffer[PG_CONTROL_FILE_SIZE];
 	char		ControlFilePath[MAXPGPATH];
+	int			open_flag;
+
+	switch (sync_op)
+	{
+#ifndef FRONTEND
+#ifdef O_SYNC
+		case UPDATE_CONTROLFILE_O_SYNC:
+			open_flag = O_SYNC;
+			break;
+#endif
+#ifdef O_DSYNC
+		case UPDATE_CONTROLFILE_O_DSYNC:
+			open_flag = O_DSYNC;
+			break;
+#endif
+#endif
+		default:
+			open_flag = 0;
+	}
 
 	/* Update timestamp  */
 	ControlFile->time = (pg_time_t) time(NULL);
@@ -278,13 +299,13 @@ update_controlfile(const char *DataDir,
 	 * All errors issue a PANIC, so no need to use OpenTransientFile() and to
 	 * worry about file descriptor leaks.
 	 */
-	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY)) < 0)
+	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY | open_flag)) < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						ControlFilePath)));
 #else
-	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY,
+	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY | open_flag,
 				   pg_file_create_mode)) == -1)
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
@@ -331,17 +352,29 @@ update_controlfile(const char *DataDir,
 	pgstat_report_wait_end();
 #endif
 
-	if (do_sync)
+	if (sync_op != 0)
 	{
 #ifndef FRONTEND
 		pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE);
-		if (pg_fsync(fd) != 0)
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							ControlFilePath)));
+		if (sync_op == UPDATE_CONTROLFILE_FDATASYNC)
+		{
+			if (fdatasync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
+		else if (sync_op == UPDATE_CONTROLFILE_FSYNC)
+		{
+			if (pg_fsync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
 		pgstat_report_wait_end();
 #else
+		/* In frontend code, non-zero sync_op gets you plain fsync(). */
 		if (fsync(fd) != 0)
 			pg_fatal("could not fsync file \"%s\": %m", ControlFilePath);
 #endif
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 06825cad85..de6de5571e 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -12,9 +12,14 @@
 
 #include "catalog/pg_control.h"
 
+#define UPDATE_CONTROLFILE_FSYNC 1
+#define UPDATE_CONTROLFILE_FDATASYNC 2
+#define UPDATE_CONTROLFILE_O_SYNC 3
+#define UPDATE_CONTROLFILE_O_DSYNC 4
+
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
-							   ControlFileData *ControlFile, bool do_sync);
+							   ControlFileData *ControlFile, int sync_op);
 extern int lock_controlfile(int fd, bool exclusive);
 #ifdef WIN32
 extern int unlock_controlfile(int fd);
-- 
2.35.1

#17

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#16)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Fri, Feb 17, 2023 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file. Hrmph. Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

If we go this way, I suppose, in theory at least, someone with
external pg_backup_start()-based tools might also want to hold the
lock while copying pg_control. Otherwise they might fail to open it
on Windows (where that patch uses a mandatory lock) or copy garbage on
Linux (as they can today, I assume), with non-zero probability -- at
least when copying files from a hot standby. Or backup tools might
want to get the file contents through some entirely different
mechanism that does the right interlocking (whatever that might be,
maybe inside the server). Perhaps this is not so much the localised
systems programming curiosity I thought it was, and has implications
that'd need to be part of the documented low-level backup steps. It
makes me like the idea a bit less. It'd be good to hear from backup
gurus what they think about that.

One cute hack I thought of to make the file lock effectively advisory
on Windows is to lock a byte range *past the end* of the file (the
documentation says you can do that). That shouldn't stop programs
that want to read the file without locking and don't know/care about
our scheme (that is, pre-existing backup tools that haven't considered
this problem and remain oblivious or accept the (low) risk of torn
reads), but will block other participants in our scheme.

If we went back to the keep-rereading-until-it-stops-changing model,
then an external backup tool would need to be prepared to do that too,
in theory at least. Maybe some already do something like that?

Or maybe the problem is/was too theoretical before...

#18

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Thomas Munro (#16)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

On 17.02.2023 06:21, Thomas Munro wrote:

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing

I think this is true, as documentation says about write operations only,
but not about the read ones.

(BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

Very interesting find! For instance, the volume of Intel® Optane™ Persistent Memory
already reaches 512GB and can be potentially used for cluster data. As the
first step it would be good to understand what Microsoft means by
large memory pages, what size are they talking about, that is, where
is the reasonable boundary for using BTT; i suppose, this will help choose
whether to use BTT or have to write own DAX-aware code.

With this patch we would stop relying on the second thing. Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking). Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

Yes. indeed. But unless it's an atomic or transactional filesystem. In
such a case there is almost nothing to do. Another thing is that
it seems such a systems do not exist in reality although it has been
discussed many times I've googled some information on this topic e.g
[1]: https://lwn.net/Articles/789600/
Microsoft's own development.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

Therefore, we need a solution for Windows too. I tried to write the
equivalent code, in the attached. I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix). Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that. I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do. Do you have
any ideas?

Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
Maybe logs of such a fail were saved somewhere? I would like to see
them if possible.

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file. Hrmph. Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

Here, possibly pass a boolean flag into sendFile()? When it is true,
then take a lock after OpenTransientFile() and release it before
CloseTransientFile() if under Windows.

There are also two places where read or write from/to the pg_control
occur. These are functions WriteControlFile() in xlog.c and
read_controlfile() in pg_resetwal.c.
For the second case, locking definitely not necessary as
the server is stopped. For the first case seems too as BootStrapXLOG()
where WriteControlFile() will be called inside must be called
only once on system install.

Since i've smoothly moved on to the code review here there is a
suggestion at your discretion to add error messages to get_controlfile()
and update_controlfile() if unlock_controlfile() fails.

I think we should just make fdatasync the default on all
systems.

Agreed. And maybe choose the UPDATE_CONTROLFILE_FDATASYNC as the default
case in UpdateControlFile() since fdatasync now the default on all
systems and its metadata are of no interest?

On 22.02.2023 03:10, Thomas Munro wrote:

If we go this way, I suppose, in theory at least, someone with
external pg_backup_start()-based tools might also want to hold the
lock while copying pg_control. Otherwise they might fail to open it
on Windows (where that patch uses a mandatory lock)

As for external pg_backup_start()-based tools if somebody want to take the
lock while copying pg_control i suppose it's a normal case. He may have to wait
a bit until we release the lock, like in lock_controlfile(). Moreover
this is a very good desire, as it guarantees the integrity of pg_control if
only someone is going to use F_SETLKW rather than non-waiting F_SETLK.

Backup of locked files is the common place in Windows and all standard
backup tools can do it well via VSS (volume shadow copy) including
embedded windows backup tool. Just in case, i tested it experimentally.
During the torture test first try to copy pg_control and predictably caught:
Error: Cannot read C:\pgbins\master\data\global\pg_control!
"The process cannot access the file because another process has locked a portion of the file."
But copy using own windows backup utility successfully copied it with VSS.

One cute hack I thought of to make the file lock effectively advisory

on Windows is to lock a byte range *past the end* of the file (the
documentation says you can do that). That shouldn't stop programs
that want to read the file without locking and don't know/care about
our scheme (that is, pre-existing backup tools that haven't considered
this problem and remain oblivious or accept the (low) risk of torn
reads), but will block other participants in our scheme.

A very interesting idea. It makes sense when someone using external
backup tools that can not work with VSS. But the fact of using such a tools
under Windows is highly doubtful, i guess. It will not allow to backup
many other applications and windows system itself.
Let me to join you suggestion that it'd be good to hear from backup
gurus what they think about that.

or copy garbage on
Linux (as they can today, I assume), with non-zero probability -- at
least when copying files from a hot standby.
Or backup tools might
want to get the file contents through some entirely different
mechanism that does the right interlocking (whatever that might be,
maybe inside the server). Perhaps this is not so much the localised
systems programming curiosity I thought it was, and has implications
that'd need to be part of the documented low-level backup steps. It
makes me like the idea a bit less. It'd be good to hear from backup
gurus what they think about that.
If we went back to the keep-rereading-until-it-stops-changing model,
then an external backup tool would need to be prepared to do that too,
in theory at least. Maybe some already do something like that?

Or maybe the problem is/was too theoretical before...

As far as i understand, this problem has always been, but the probability of
this is extremely small in practice, which is directly pointed in
the docs [4]:
"So while it is theoretically a weak spot, pg_control does not seem
to be a problem in practice."

Here's a patch like that.

In this patch, the problem is solved for the live database and
maybe remains for some possible cases of an external backup. In a whole,
i think it can be stated that this is a sensible step forward.

Just like last time, i ran a long stress test under windows with current patch.
There were no errors for more than 3 days even with fsync=off.

[1]: https://lwn.net/Articles/789600/
[2]: https://github.com/ut-osa/txfs
[3]: https://en.wikipedia.org/wiki/Transactional_NTFS[4] https://www.postgresql.org/docs/devel/wal-internals.html

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#19

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Anton A. Melnikov (#18)

3 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Fri, Feb 24, 2023 at 11:12 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

On 17.02.2023 06:21, Thomas Munro wrote:

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

/messages/by-id/367d01a7-90bb-9b70-4cda-248e81cc475c@cosium.com

Therefore, we need a solution for Windows too. I tried to write the
equivalent code, in the attached. I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix). Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that. I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do. Do you have
any ideas?

Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
Maybe logs of such a fail were saved somewhere? I would like to see
them if possible.

I think it was this one:

https://cirrus-ci.com/task/5004082033721344

For example, see subscription/011_generated which failed like this:

2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC: could not
read file "global/pg_control": Permission denied

That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested. There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.

But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):

With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently). Changes:

1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)

Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail. So, in this
version I protected that sendFile() with ControlFileLock. But...

Isn't that a bit strange? To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation. That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).
Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear. Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup. I'm not sure about any of that, though, it's just an idea,
not tested.

Or maybe the problem is/was too theoretical before...

As far as i understand, this problem has always been, but the probability of
this is extremely small in practice, which is directly pointed in
the docs [4]:
"So while it is theoretically a weak spot, pg_control does not seem
to be a problem in practice."

I guess that was talking about power loss atomicity again? Knowledge
of the read/write atomicity problem seems to be less evenly
distributed (and I think it became more likely in Linux > 3.something;
and the Windows situation possibly hadn't been examined by anyone
before).

Here's a patch like that.

In this patch, the problem is solved for the live database and
maybe remains for some possible cases of an external backup. In a whole,
i think it can be stated that this is a sensible step forward.

Just like last time, i ran a long stress test under windows with current patch.
There were no errors for more than 3 days even with fsync=off.

Thanks!

Attachments:

v3-0001-Lock-pg_control-while-reading-or-writing.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Lock-pg_control-while-reading-or-writing.patchDownload

From 62c107aa2e3c52192baf6105d6f13cd8601b7913 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH v3 1/3] Lock pg_control while reading or writing.

Front-end programs that read pg_control and user-facing functions run
in the backend read pg_control without acquiring ControlFileLock.  If
you're unlucky enough to read() while the backend is in write(), on at
least ext4 or NTFS, you might see partial data.  Use a POSIX advisory
file lock or a Windows mandatory file lock, to avoid this problem.
Since Windows mandatory locks might cause existing external backup tools
to fail in ReadFile(), lock one byte past the end so that only tools
that opt into this scheme are affected by it.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/backup/basebackup.c |   5 ++
 src/common/controldata_utils.c  | 137 ++++++++++++++++++++++++++++++++
 2 files changed, 142 insertions(+)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 3fb9451643..7e567300bc 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -39,6 +39,7 @@
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -345,8 +346,12 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 							(errcode_for_file_access(),
 							 errmsg("could not stat file \"%s\": %m",
 									XLOG_CONTROL_FILE)));
+
+				/* Lock to avoid torn reads, if there is a concurrent write. */
+				LWLockAcquire(ControlFileLock, LW_SHARED);
 				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
 						 false, InvalidOid, &manifest, NULL);
+				LWLockRelease(ControlFileLock);
 			}
 			else
 			{
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..e1b7b936ba 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -39,6 +39,102 @@
 #include "storage/fd.h"
 #endif
 
+/*
+ * Lock the control file until closed.  This is used for the benefit of
+ * frontend/external programs that want to take an atomic copy of the control
+ * file, when though it might be written to concurrently by the server.
+ * On some systems including Windows NTFS and Linux ext4, that might otherwise
+ * cause a torn read.
+ *
+ * (The server also reads and writes the control file directly sometimes, not
+ * through these routines; that's OK, because it uses ControlFileLock, or no
+ * locking when it expects no concurrent writes.)
+ */
+static int
+lock_controlfile(int fd, bool exclusive)
+{
+#ifdef WIN32
+	OVERLAPPED	overlapped = {.Offset = PG_CONTROL_FILE_SIZE};
+	HANDLE		handle;
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	/*
+	 * On Windows, we lock the first byte past the end of the control file (see
+	 * overlapped.Offset).  This means that we shouldn't cause errors in
+	 * external tools that just want to read the file, but we can block other
+	 * processes using this routine or that know about this protocol.  This
+	 * provides an approximation of Unix's "advisory" locking.
+	 */
+	if (!LockFileEx(handle,
+					exclusive ? LOCKFILE_EXCLUSIVE_LOCK : 0,
+					0,
+					1,
+					0,
+					&overlapped))
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+
+	return 0;
+#else
+	struct flock lock;
+	int			rc;
+
+	memset(&lock, 0, sizeof(lock));
+	lock.l_type = exclusive ? F_WRLCK : F_RDLCK;
+	lock.l_start = 0;
+	lock.l_whence = SEEK_SET;
+	lock.l_len = 0;
+	lock.l_pid = -1;
+
+	do
+	{
+		rc = fcntl(fd, F_SETLKW, &lock);
+	}
+	while (rc < 0 && errno == EINTR);
+
+	return rc;
+#endif
+}
+
+#ifdef WIN32
+/*
+ * Release lock acquire with lock_controlfile().  On POSIX systems, we don't
+ * bother making an extra system call to release the lock, since it'll be
+ * released on close anyway.  On Windows, explicit release is recommended by
+ * the documentation to make sure it is done synchronously.
+ */
+static int
+unlock_controlfile(int fd)
+{
+	OVERLAPPED	overlapped = {.Offset = PG_CONTROL_FILE_SIZE};
+	HANDLE		handle;
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	/* Unlock first byte. */
+	if (!UnlockFileEx(handle, 0, 1, 0, &overlapped))
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+
+	return 0;
+}
+#endif
+
 /*
  * get_controlfile()
  *
@@ -74,6 +170,20 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 				 ControlFilePath);
 #endif
 
+	/* Make sure we can read the file atomically. */
+	if (lock_controlfile(fd, false) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for reading: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for reading: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	r = read(fd, ControlFile, sizeof(ControlFileData));
 	if (r != sizeof(ControlFileData))
 	{
@@ -97,6 +207,10 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 #endif
 	}
 
+#ifdef WIN32
+	unlock_controlfile(fd);
+#endif
+
 #ifndef FRONTEND
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -186,6 +300,25 @@ update_controlfile(const char *DataDir,
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
 
+	/*
+	 * Make sure that any concurrent reader (including frontend programs) can
+	 * read the file atomically.  Note that this refers to atomicity of
+	 * concurrent reads and writes.  For our assumption of atomicity under
+	 * power failure, see PG_CONTROL_MAX_SAFE_SIZE.
+	 */
+	if (lock_controlfile(fd, true) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for writing: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for writing: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	errno = 0;
 #ifndef FRONTEND
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
@@ -225,6 +358,10 @@ update_controlfile(const char *DataDir,
 #endif
 	}
 
+#ifdef WIN32
+	unlock_controlfile(fd);
+#endif
+
 	if (close(fd) != 0)
 	{
 #ifndef FRONTEND
-- 
2.39.2

v3-0002-Make-wal_sync_method-fdatasync-the-default-on-all.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Make-wal_sync_method-fdatasync-the-default-on-all.patchDownload

From 17a942e760f2abec6490f416609f986b7180763c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 16 Feb 2023 20:02:22 +1300
Subject: [PATCH v3 2/3] Make wal_sync_method=fdatasync the default on all
 OSes.

Previously, fdatasync was the default level on Linux and FreeBSD by
special rules, and on OpenBSD and DragonflyBSD because they didn't have
O_DSYNC != O_SYNC or O_DSYNC at all.  For every other system, we'd
choose open_datasync.

Use fdatasync everywhere, for consistency.  This became possible after
commit 9430fb40 added support for Windows, the last known system not to
have it.  This means that we'll now flush caches on consumer drives by
default on Windows, where previously we didn't, which seems like a
better default for crash safety.  Users who want the older behavior can
still request it with wal_sync_method=open_datasync.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 doc/src/sgml/config.sgml              |  5 ++---
 doc/src/sgml/wal.sgml                 |  8 ++++----
 src/bin/pg_test_fsync/pg_test_fsync.c |  4 ++--
 src/include/access/xlogdefs.h         | 14 +-------------
 src/include/port/freebsd.h            |  7 -------
 src/include/port/linux.h              |  8 --------
 6 files changed, 9 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..15d5f2bf2b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3107,9 +3107,8 @@ include_dir 'conf.d'
        <para>
         The <literal>open_</literal>* options also use <literal>O_DIRECT</literal> if available.
         Not all of these choices are available on all platforms.
-        The default is the first method in the above list that is supported
-        by the platform, except that <literal>fdatasync</literal> is the default on
-        Linux and FreeBSD.  The default is not necessarily ideal; it might be
+        The default is <literal>fdatasync</literal>.
+        The default is not necessarily ideal; it might be
         necessary to change this setting or other aspects of your system
         configuration in order to create a crash-safe configuration or
         achieve optimal performance.
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index ed7929cbcd..ffc22a7ded 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -106,12 +106,12 @@
     <listitem>
       <para>
         On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is
-        <literal>open_datasync</literal> (the default), write caching can be disabled
+        <literal>open_datasync</literal>, write caching can be disabled
         by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>.
         Alternatively, set <varname>wal_sync_method</varname> to
-        <literal>fdatasync</literal> (NTFS only), <literal>fsync</literal> or
-        <literal>fsync_writethrough</literal>, which prevent
-        write caching.
+        <literal>fdatasync</literal> (the default), <literal>fsync</literal> or
+        <literal>fsync_writethrough</literal>, which use system calls that
+        flush write caches.
       </para>
     </listitem>
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 3d5e8f30ab..45bd2daf2e 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -292,7 +292,7 @@ test_sync(int writes_per_op)
 		printf(_("\nCompare file sync methods using one %dkB write:\n"), XLOG_BLCKSZ_K);
 	else
 		printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
-	printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n"));
+	printf(_("(fdatasync is the default)\n"));
 
 	/*
 	 * Test open_datasync if available
@@ -326,7 +326,7 @@ test_sync(int writes_per_op)
 #endif
 
 /*
- * Test fdatasync if available
+ * Test fdatasync
  */
 	printf(LABEL_FORMAT, "fdatasync");
 	fflush(stdout);
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index fe794c7740..a628976902 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -64,19 +64,7 @@ typedef uint32 TimeLineID;
  */
 typedef uint16 RepOriginId;
 
-/*
- * This chunk of hackery attempts to determine which file sync methods
- * are available on the current platform, and to choose an appropriate
- * default method.
- *
- * Note that we define our own O_DSYNC on Windows, but not O_SYNC.
- */
-#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
-#define DEFAULT_SYNC_METHOD		PLATFORM_DEFAULT_SYNC_METHOD
-#elif defined(O_DSYNC) && (!defined(O_SYNC) || O_DSYNC != O_SYNC)
-#define DEFAULT_SYNC_METHOD		SYNC_METHOD_OPEN_DSYNC
-#else
+/* Default synchronization method for WAL. */
 #define DEFAULT_SYNC_METHOD		SYNC_METHOD_FDATASYNC
-#endif
 
 #endif							/* XLOG_DEFS_H */
diff --git a/src/include/port/freebsd.h b/src/include/port/freebsd.h
index 0e3fde55d6..2e36d3da4f 100644
--- a/src/include/port/freebsd.h
+++ b/src/include/port/freebsd.h
@@ -1,8 +1 @@
 /* src/include/port/freebsd.h */
-
-/*
- * Set the default wal_sync_method to fdatasync.  xlogdefs.h's normal rules
- * would prefer open_datasync on FreeBSD 13+, but that is not a good choice on
- * many systems.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD	SYNC_METHOD_FDATASYNC
diff --git a/src/include/port/linux.h b/src/include/port/linux.h
index 7a6e46cdbb..acd867606c 100644
--- a/src/include/port/linux.h
+++ b/src/include/port/linux.h
@@ -12,11 +12,3 @@
  * to have a kernel version test here.
  */
 #define HAVE_LINUX_EIDRM_BUG
-
-/*
- * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
- * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
- * perform better and (b) causes outright failures on ext4 data=journal
- * filesystems, because those don't support O_DIRECT.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD	SYNC_METHOD_FDATASYNC
-- 
2.39.2

v3-0003-Apply-wal_sync_method-to-pg_control-file.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Apply-wal_sync_method-to-pg_control-file.patchDownload

From e28b7bae80a711a8c23fc2332e653097a931e2bb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 21:08:12 +1300
Subject: [PATCH v3 3/3] Apply wal_sync_method to pg_control file.

When writing out the control file frequently on a standby, we can go a
little faster on some systems by using fdatasync() instead of fsync(),
with no loss of safety.  We already respected the special
"wal_sync_method=fsync_writethrough" via a circuitous route, but we
might as well support the full range of methods.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/access/transam/xlog.c      | 19 ++++++++-
 src/common/controldata_utils.c         | 59 ++++++++++++++++++++------
 src/include/common/controldata_utils.h |  7 ++-
 3 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87af608d15..4ad4899256 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4167,7 +4167,24 @@ ReadControlFile(void)
 static void
 UpdateControlFile(void)
 {
-	update_controlfile(DataDir, ControlFile, true);
+	int			sync_op;
+
+	switch (sync_method)
+	{
+		case SYNC_METHOD_FDATASYNC:
+			sync_op = UPDATE_CONTROLFILE_FDATASYNC;
+			break;
+		case SYNC_METHOD_OPEN:
+			sync_op = UPDATE_CONTROLFILE_O_SYNC;
+			break;
+		case SYNC_METHOD_OPEN_DSYNC:
+			sync_op = UPDATE_CONTROLFILE_O_DSYNC;
+			break;
+		default:
+			sync_op = UPDATE_CONTROLFILE_FSYNC;
+	}
+
+	update_controlfile(DataDir, ControlFile, sync_op);
 }
 
 /*
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index e1b7b936ba..0c81f66dc4 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -250,18 +250,39 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
  * update_controlfile()
  *
  * Update controlfile values with the contents given by caller.  The
- * contents to write are included in "ControlFile". "do_sync" can be
- * optionally used to flush the updated control file.  Note that it is up
- * to the caller to properly lock ControlFileLock when calling this
- * routine in the backend.
+ * contents to write are included in "ControlFile".  "sync_op" can be set
+ * to 0 or a synchronization method UPDATE_CONTROLFILE_XXX.  In frontend code,
+ * only 0 and non-zero (fsync) are meaningful.
+ * Note that it is up to the caller to properly lock ControlFileLock when
+ * calling this routine in the backend.
  */
 void
 update_controlfile(const char *DataDir,
-				   ControlFileData *ControlFile, bool do_sync)
+				   ControlFileData *ControlFile,
+				   int sync_op)
 {
 	int			fd;
 	char		buffer[PG_CONTROL_FILE_SIZE];
 	char		ControlFilePath[MAXPGPATH];
+	int			open_flag;
+
+	switch (sync_op)
+	{
+#ifndef FRONTEND
+#ifdef O_SYNC
+		case UPDATE_CONTROLFILE_O_SYNC:
+			open_flag = O_SYNC;
+			break;
+#endif
+#ifdef O_DSYNC
+		case UPDATE_CONTROLFILE_O_DSYNC:
+			open_flag = O_DSYNC;
+			break;
+#endif
+#endif
+		default:
+			open_flag = 0;
+	}
 
 	/* Update timestamp  */
 	ControlFile->time = (pg_time_t) time(NULL);
@@ -289,13 +310,13 @@ update_controlfile(const char *DataDir,
 	 * All errors issue a PANIC, so no need to use OpenTransientFile() and to
 	 * worry about file descriptor leaks.
 	 */
-	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY)) < 0)
+	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY | open_flag)) < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						ControlFilePath)));
 #else
-	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY,
+	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY | open_flag,
 				   pg_file_create_mode)) == -1)
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
@@ -342,17 +363,29 @@ update_controlfile(const char *DataDir,
 	pgstat_report_wait_end();
 #endif
 
-	if (do_sync)
+	if (sync_op != 0)
 	{
 #ifndef FRONTEND
 		pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE);
-		if (pg_fsync(fd) != 0)
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							ControlFilePath)));
+		if (sync_op == UPDATE_CONTROLFILE_FDATASYNC)
+		{
+			if (fdatasync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
+		else if (sync_op == UPDATE_CONTROLFILE_FSYNC)
+		{
+			if (pg_fsync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
 		pgstat_report_wait_end();
 #else
+		/* In frontend code, non-zero sync_op gets you plain fsync(). */
 		if (fsync(fd) != 0)
 			pg_fatal("could not fsync file \"%s\": %m", ControlFilePath);
 #endif
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 49e7c52d31..879455c8fb 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -12,8 +12,13 @@
 
 #include "catalog/pg_control.h"
 
+#define UPDATE_CONTROLFILE_FSYNC 1
+#define UPDATE_CONTROLFILE_FDATASYNC 2
+#define UPDATE_CONTROLFILE_O_SYNC 3
+#define UPDATE_CONTROLFILE_O_DSYNC 4
+
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
-							   ControlFileData *ControlFile, bool do_sync);
+							   ControlFileData *ControlFile, int sync_op);
 
 #endif							/* COMMON_CONTROLDATA_UTILS_H */
-- 
2.39.2

#20

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Thomas Munro (#19)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

On 04.03.2023 00:39, Thomas Munro wrote:

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

/messages/by-id/367d01a7-90bb-9b70-4cda-248e81cc475c@cosium.com

Thanks! The important words there:

But I fail to see how full_page_writes prevents this since it only act on writes

It ensures the damage is later repaired during WAL replay. Which can only
happen if the WAL contains the necessary information to do so - the full page
writes.

Together with the docs about restoring corrupted pg_control in the
pg_resetwal utility description these words made me think about whether
to save the contents of pg_control at the beginning and end of
checkpoint into WAL and teach pg_resetwal to read them? It would be like
a periodic backup of the pg_control to a safe place.
This thought has nothing to do with this patch and this thread, and, in general,
i'm not sure if it has any practical meaning, and whether, on the contrary, it
may lead to some difficulties. If it seems that there is a sense, then it
will be possible to think further about it.

For example, see subscription/011_generated which failed like this:

2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC: could not
read file "global/pg_control": Permission denied

That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested. There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.

But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):

Seems to be it's a race between the first reading of the pg_control in PostmasterMain()
in LocalProcessControlFile(false) and the second one in SubPostmasterMain() here:
/*
* (re-)read control file, as it contains config. The postmaster will
* already have read this, but this process doesn't know about that.
*/
LocalProcessControlFile(false);

which crashes according to the crash log: crashlog-postgres.exe_19a0_2023-02-16_06-57-26-675.txt

With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently). Changes:

1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)

Although the changes in 1. contributes to the problem described above again,
but 4. fixes this. And i did not find any other places where ReadControlFile()
can be called in different processes.That's all ok.
Thanks to 4., now it is not necessary to use VSS to copy the pg_control file,
it can be copied in a common way even during the torture test. This is very good.
I really like the idea with LWLock where possible.
In general, i think that these changes make the patch more lightweight and fast.

Also i ran tests for more than a day in stress mode with fsync=off under windows
and Linux and found no problems. Patch-tester also passes without errors.

I would like to move this patch to RFC, since I don’t see any problems
both in the code and in the tests, but the pg_backup_start() subproblem confuses me.
Maybe move it to a separate patch in a distinct thread?
As there are a number of suggestions and questions to discuss such as:

Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail. So, in this
version I protected that sendFile() with ControlFileLock. But...

Isn't that a bit strange? To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation. That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).

Variant with separate utility looks good, with the recommendation
in the doc to use it for the pg_control coping.

Also seems it is possible to make a function available in psql, such as
export_pg_control('dst_path') with the destination path as argument
and call it before pg_backup_stop().
Or pass the pg_control destination path to the pg_backup_stop() as extra argument.
Or save pg_control to a predetermined location during pg_backup_stop() and specify
in the docs that one need to copy it from there. I suppose that we have the right
to ask the user to perform some manual manipulations here like with backup_label.

Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear. Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup. I'm not sure about any of that, though, it's just an idea,
not tested.

Sorry, i didn't understand the question about log. Would you explain me
please what kind of log did you mention and where can i look this
safe copy creation in the code?

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#21

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Anton A. Melnikov (#20)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Wed, Mar 8, 2023 at 4:43 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

On 04.03.2023 00:39, Thomas Munro wrote:

Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear. Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup. I'm not sure about any of that, though, it's just an idea,
not tested.

Sorry, i didn't understand the question about log. Would you explain me
please what kind of log did you mention and where can i look this
safe copy creation in the code?

Sorry, I was confused; please ignore that part. We don't have a copy
of the control file anywhere else. (Perhaps we should, but that could
be a separate topic.)

#22

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Thomas Munro (#21)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On 08.03.2023 07:28, Thomas Munro wrote:

Sorry, I was confused; please ignore that part. We don't have a copy
of the control file anywhere else. (Perhaps we should, but that could
be a separate topic.)

That’s all right! Fully agreed that this is a possible separate topic.

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#23

Daniel Gustafsson

daniel@yesql.se

over 2 years ago

In reply to: Thomas Munro (#19)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

This patch no longer applies and needs a rebase.

Given where we are in the commitfest, do you think this patch has the potential
to go in or should it be moved?

--
Daniel Gustafsson

#24

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Daniel Gustafsson (#23)

2 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

This is a frustrating thread, because despite the last patch solving
most of the problems we discussed, it doesn't address the
low-level-backup procedure in a nice way. We'd have to tell users
they have to flock that file, or add a new step "pg_controldata --raw

pg_control", which seems weird when they already have a connection

to the server.

Maybe it just doesn't matter if eg the pg_controldata program can
spuriously fail if pointed at a running system, and I was being too
dogmatic trying to fix even that. Maybe we should just focus on
fixing backups. Even there, I am beginning to suspect we are solving
this problem in the wrong place when a higher level change could
simplify the problem away.

Idea for future research: Perhaps pg_backup_stop()'s label-file
output should include the control file image (suitably encoded)? Then
the recovery-from-label code could completely ignore the existing
control file, and overwrite it using that copy. It's already
partially ignoring it, by using the label file's checkpoint LSN
instead of the control file's. Perhaps the captured copy could
include the correct LSN already, simplifying that code, and the low
level backup procedure would not need any additional steps or caveats.
No more atomicity problem for low-level-backups... but probably not
something we would back-patch, for such a rare failure mode.

Here's a new minimal patch that solves only the bugs in basebackup +
the simple SQL-facing functions that read the control file, by simply
acquiring ControlFileLock in the obvious places. This should be
simple enough for back-patching?

Perhaps we could use the control file image from server memory, but
that requires us to be certain that its CRC is always up to date.
That seems to be true, but I didn't want to require it for this, and
it doesn't seem important for non-performance-critical code.

Thoughts?

As for the other topics that came up in this thread, I kicked the
wal_sync_method thing out to its own thread[1]/messages/by-id/CA+hUKG+F0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg@mail.gmail.com. (There was a logical
chain connecting these topics: "can I add file lock system calls
here?" -> "well if someone is going to complain that it's performance
critical then why are we using unnecessarily slow pg_fsync()?" ->
"well if we change that to pg_fdatasync() we have to address known
weakness/kludge on macOS first". I don't like the flock stuff
anymore, but I do want to fix the known macOS problem independently.
Hereby disentangled.)

[1]: /messages/by-id/CA+hUKG+F0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg@mail.gmail.com

Attachments:

v4-0001-Acquire-ControlFileLock-in-base-backups.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Acquire-ControlFileLock-in-base-backups.patchDownload

From ffc679fa4f4261cf9050571bb486575b2e4995f3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH v4 1/2] Acquire ControlFileLock in base backups.

The control file could previously be overwritten while a base backup was
reading it.  On some file systems with weak interlocking, including ext4
and ntfs, the concurrent read could see a mixture of old and new
contents.  The resulting copy would be corrupted, and the new cluster
would not be able to start up.  Other files are read without
interlocking, but those will be corrected by replaying the log.

Exclude this possibility by acquiring ControlFileLock.  Copy into a
temporary buffer first because we don't want to hold the lock while in
sendFile() code that might sleep if throttling is configured.  This
requires a minor adjustment to sendFileWithContent() to support binary
data.

This does not address the same possibility when using using external
file system copy tools (as described in the documentation under "Making
a Base Backup Using the Low Level API").

Back-patch to all supported releases.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru> (earlier version)
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/backup/basebackup.c | 36 ++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..f03496665f 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -24,6 +24,7 @@
 #include "backup/basebackup_target.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
+#include "common/controldata_utils.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
@@ -39,6 +40,7 @@
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -84,7 +86,7 @@ static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfile
 					 struct stat *statbuf, bool missing_ok, Oid dboid,
 					 backup_manifest_info *manifest, const char *spcoid);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -315,7 +317,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
+				ControlFileData *control_file;
+				bool		crc_ok;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
 
@@ -324,14 +327,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				/* In the main tar, include the backup_label first... */
 				backup_label = build_backup_content(backup_state, false);
 				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
+									backup_label, -1, &manifest);
 				pfree(backup_label);
 
 				/* Then the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -340,13 +343,13 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 						sendtblspclinks, &manifest, NULL);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, &manifest, NULL);
+				LWLockAcquire(ControlFileLock, LW_SHARED);
+				control_file = get_controlfile(DataDir, &crc_ok);
+				LWLockRelease(ControlFileLock);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *) control_file, sizeof(*control_file),
+									&manifest);
+				pfree(control_file);
 			}
 			else
 			{
@@ -591,7 +594,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -619,7 +622,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1030,18 +1033,19 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
+					int len,
 					backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	if (len < 0)
+		len = strlen(content);
 
 	/*
 	 * Construct a stat struct for the backup_label file we're injecting in
-- 
2.41.0

v4-0002-Acquire-ControlFileLock-in-SQL-functions.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Acquire-ControlFileLock-in-SQL-functions.patchDownload

From 7dd75dc8098f2981c929245ec75ff6e2931f3b71 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 11:50:27 +1200
Subject: [PATCH v4 2/2] Acquire ControlFileLock in SQL functions.

Commit dc7d70ea added functions that read the control file, but didn't
acquire ControlFileLock.  With unlucky timing and on certain file
systems, they could read corrupted data that is in the process of being
overwritten, and the checksum would fail.

Back-patch to all supported releases.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/utils/misc/pg_controldata.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index f2c1084797..a1003a464d 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -24,6 +24,7 @@
 #include "common/controldata_utils.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "storage/lwlock.h"
 #include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
@@ -42,7 +43,9 @@ pg_control_system(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -80,7 +83,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* Read the control file. */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -169,7 +174,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -208,7 +215,9 @@ pg_control_init(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
-- 
2.41.0

#25

Stephen Frost

sfrost@snowman.net

over 2 years ago

In reply to: Thomas Munro (#24)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Greetings,

(Adding David Steele into the CC on this one...)

* Thomas Munro (thomas.munro@gmail.com) wrote:

This is a frustrating thread, because despite the last patch solving
most of the problems we discussed, it doesn't address the
low-level-backup procedure in a nice way. We'd have to tell users
they have to flock that file, or add a new step "pg_controldata --raw

pg_control", which seems weird when they already have a connection

to the server.

flock'ing the file strikes me as dangerous to ask external tools to do
due to the chances of breaking the running system if they don't do it
right. I'm generally open to the idea of having the backup tool have to
do more complicated work to be correct but that just seems likely to
cause issues. Also- haven't looked yet, but I'm not sure that's even
possible if your backup tool is running as a user who only has read
access to the data directory? I don't want us to give up on that
feature.

Maybe it just doesn't matter if eg the pg_controldata program can
spuriously fail if pointed at a running system, and I was being too
dogmatic trying to fix even that. Maybe we should just focus on
fixing backups. Even there, I am beginning to suspect we are solving
this problem in the wrong place when a higher level change could
simplify the problem away.

For a running system.. perhaps pg_controldata should be connecting to
the database and calling functions there? Or just complain that the
system is online and tell the user to do that?

Idea for future research: Perhaps pg_backup_stop()'s label-file
output should include the control file image (suitably encoded)? Then
the recovery-from-label code could completely ignore the existing
control file, and overwrite it using that copy. It's already
partially ignoring it, by using the label file's checkpoint LSN
instead of the control file's. Perhaps the captured copy could
include the correct LSN already, simplifying that code, and the low
level backup procedure would not need any additional steps or caveats.
No more atomicity problem for low-level-backups... but probably not
something we would back-patch, for such a rare failure mode.

I like this general direction and wonder if we could possibly even push
a bit harder on it: have the backup_label include the control file's
contents in some form that is understood and then tell tools to *not*
copy the running pg_control file ... and maybe even complain if a
pg_control file exists when we detect that backup_label has the control
file's image. We've certainly had problems in the past where people
would just nuke the backup_label file, even though they were restoring
from a backup, because they couldn't start the system up since their
restore command wasn't set up properly or their WAL archive was missing.

Being able to get rid of the control file being in the backup at all
would make it harder for someone to get to a corrupted-but-running
system and that seems like it's a good thing.

Here's a new minimal patch that solves only the bugs in basebackup +
the simple SQL-facing functions that read the control file, by simply
acquiring ControlFileLock in the obvious places. This should be
simple enough for back-patching?

I don't particularly like fixing this in a way that only works for
pg_basebackup and means that the users of other backup tools don't have
a solution to this problem. What are we supposed to tell users of
pgBackRest when they see this fix for pg_basebackup in the next set of
minor releases and they ask us if we've addressed this risk?

We might be able to accept the 're-read on CRC failure' approach, if it
were being used for pg_controldata and we documented that external
tools and especially backup tools using the low-level API are required
to check the CRC and to re-read on failure if accessing a running
system.

While it's not ideal, maybe we could get away with changing the contents
of the backup_label as part of a back-patch? The documentation, at
least on a quick look, says to copy the second field from pg_backup_stop
into a backup_label file but doesn't say anything about what those
contents are or if they can change. That would at least address the
concern of backups ending up with a corrupt pg_control file and not
being able to be restored, even if tools aren't updated to verify the
CRC or similar. Of course, that's a fair bit of code churn for a
backpatch, which I certainly understand as being risky. If we can't
back-patch that, it might still be the way to go moving forward, while
also telling tools to check the CRC. (I'm not going to try to figure
out some back-patchable pretend solution for this for shell scripts that
pretend to be able to backup running PG databases; this issue is
probably the least of their issues anyway...)

A couple of interesting notes on this though- pgBackRest doesn't only
read the pg_control file at backup time, we also check it at
archive_command time, to make sure that the system ID and version that
are in the control file match up with the information in the WAL file
we're getting ready to archive and that those match with the system ID
and version of the repo/stanza into which we are pushing the WAL file.
We do read the control file on the replica but that isn't the one we
actually push into the repo as part of a backup- that's always the one
we read from the primary (we don't currently support 'backup just from
the replica').

Coming out of our discussion regarding this, we're likely to move
forward on the check-CRC-and-re-read approach for the next pgBackRest
release. If PG provides a better solution for us to use, great, but
given that this has been shown to happen, we're not intending to wait
around for PG to provide us with a better fix.

Thanks,

Stephen

#26

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Thomas Munro (#24)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Fri, Jul 21, 2023 at 8:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Idea for future research: Perhaps pg_backup_stop()'s label-file
output should include the control file image (suitably encoded)? Then
the recovery-from-label code could completely ignore the existing
control file, and overwrite it using that copy. It's already
partially ignoring it, by using the label file's checkpoint LSN
instead of the control file's. Perhaps the captured copy could
include the correct LSN already, simplifying that code, and the low
level backup procedure would not need any additional steps or caveats.
No more atomicity problem for low-level-backups... but probably not
something we would back-patch, for such a rare failure mode.

I don't really know what the solution is, but this is a general
problem with the low-level backup API, and I think it sucks pretty
hard. Here, we're talking about the control file, but the same problem
exists with the data files. We try to work around that but it's all
hacks. Unless your backup tool has special magic powers of some kind,
you can't take a backup using either pg_basebackup or the low-level
API and then check that individual blocks have valid checksums, or
that they have sensible, interpretable contents, because they might
not. (Yeah, I know we have code to verify checksums during a base
backup, but as discussed elsewhere, it doesn't work.) It's also why we
have to force full-page write on during a backup. But the whole thing
is nasty because you can't really verify anything about the backup you
just took. It may be full of gibberish blocks but don't worry because,
if all goes well, recovery will fix it. But you won't really know
whether recovery actually does fix it. You just kind of have to cross
your fingers and hope.

It's unclear to me how we could do better, especially when using the
low-level API. BASE_BACKUP could read via shared_buffers instead of
the FS, and I think that might be a good idea if we can defend
adequately against cache poisoning, but with the low-level API someone
may just be calling a FS-level snapshot primitive. Unless we're
prepared to pause all writes while that happens, I don't know how to
do better.

--
Robert Haas
EDB: http://www.enterprisedb.com

#27

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Stephen Frost (#25)

4 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Jul 25, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:

* Thomas Munro (thomas.munro@gmail.com) wrote:

Here's a new minimal patch that solves only the bugs in basebackup +
the simple SQL-facing functions that read the control file, by simply
acquiring ControlFileLock in the obvious places. This should be
simple enough for back-patching?

I don't particularly like fixing this in a way that only works for
pg_basebackup and means that the users of other backup tools don't have
a solution to this problem. What are we supposed to tell users of
pgBackRest when they see this fix for pg_basebackup in the next set of
minor releases and they ask us if we've addressed this risk?

We might be able to accept the 're-read on CRC failure' approach, if it
were being used for pg_controldata and we documented that external
tools and especially backup tools using the low-level API are required
to check the CRC and to re-read on failure if accessing a running
system.

Hi Stephen, David, and thanks for looking. Alright, let's try that idea out.

0001 + 0002. Acquire the lock in the obvious places in the backend,
to completely exclude the chance of anything going wrong for the easy
cases, including pg_basebackup. (This is the v4 from yesterday).
And...

0003. Document this problem and how to detect it, in the
low-level-backup section. Better words welcome. And...

0004. Retries for front-end programs, using the idea suggested by Tom
(to wit: if it fails, retry until it fails with the same CRC value
twice). It's theoretically imperfect, but it's highly unlikely to
fail in practice and the best we can do without file locks or a
connection to the server AFAICT. (This is the first patch I posted,
adjusted to give up after 10 (?) loops, and not to bother at all in
backend code since that takes ControlFileLock with the 0001 patch).

While it's not ideal, maybe we could get away with changing the contents
of the backup_label as part of a back-patch? The documentation, at
least on a quick look, says to copy the second field from pg_backup_stop
into a backup_label file but doesn't say anything about what those
contents are or if they can change. That would at least address the
concern of backups ending up with a corrupt pg_control file and not
being able to be restored, even if tools aren't updated to verify the
CRC or similar. Of course, that's a fair bit of code churn for a
backpatch, which I certainly understand as being risky. If we can't
back-patch that, it might still be the way to go moving forward, while
also telling tools to check the CRC. (I'm not going to try to figure
out some back-patchable pretend solution for this for shell scripts that
pretend to be able to backup running PG databases; this issue is
probably the least of their issues anyway...)

I think you're probably right that anyone following those instructions
would be OK if we just back-patched such a thing, but it all seems a
little too much to me. +1 to looking into that for v17 (and I guess
maybe someone could eventually argue for back-patching much later with
experience). I'm sure other solutions are possible too... other
places to put a safe atomic copy of the control file could include: in
the WAL, or in extra files (pg_control.XXX) in the data directory +
some infallible way for recovery to choose which one to start up from.
Or something.

Attachments:

v5-0001-Acquire-ControlFileLock-in-base-backups.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Acquire-ControlFileLock-in-base-backups.patchDownload

From 6322d7159ae418582392fa6c0e11863a016378f4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH v5 1/4] Acquire ControlFileLock in base backups.

The control file could previously be overwritten while a base backup was
reading it.  On some file systems (ext4, ntfs), the concurrent read
could see a mixture of old and new contents.  The resulting copy would
be corrupted, and the new cluster would not be able to start up.  Other
files are read without interlocking, but those will be corrected by
replaying the log.

Exclude this possibility by acquiring ControlFileLock.  Copy into a
temporary buffer first because we don't want to hold the lock while in
sendFile() code that might sleep if throttling is configured.  This
requires a minor adjustment to sendFileWithContent() to support binary
data.

This does not address the same possibility when using using external
file system copy tools (as described in the documentation under "Making
a Base Backup Using the Low Level API").

Back-patch to all supported releases.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru> (earlier version)
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/backup/basebackup.c | 36 ++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..f03496665f 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -24,6 +24,7 @@
 #include "backup/basebackup_target.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
+#include "common/controldata_utils.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
@@ -39,6 +40,7 @@
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -84,7 +86,7 @@ static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfile
 					 struct stat *statbuf, bool missing_ok, Oid dboid,
 					 backup_manifest_info *manifest, const char *spcoid);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -315,7 +317,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
+				ControlFileData *control_file;
+				bool		crc_ok;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
 
@@ -324,14 +327,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				/* In the main tar, include the backup_label first... */
 				backup_label = build_backup_content(backup_state, false);
 				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
+									backup_label, -1, &manifest);
 				pfree(backup_label);
 
 				/* Then the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -340,13 +343,13 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 						sendtblspclinks, &manifest, NULL);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, &manifest, NULL);
+				LWLockAcquire(ControlFileLock, LW_SHARED);
+				control_file = get_controlfile(DataDir, &crc_ok);
+				LWLockRelease(ControlFileLock);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *) control_file, sizeof(*control_file),
+									&manifest);
+				pfree(control_file);
 			}
 			else
 			{
@@ -591,7 +594,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -619,7 +622,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1030,18 +1033,19 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
+					int len,
 					backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	if (len < 0)
+		len = strlen(content);
 
 	/*
 	 * Construct a stat struct for the backup_label file we're injecting in
-- 
2.39.2

v5-0002-Acquire-ControlFileLock-in-SQL-functions.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Acquire-ControlFileLock-in-SQL-functions.patchDownload

From e55b05d52b792c2de35cb519e3a6f0de6030b8f9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 11:50:27 +1200
Subject: [PATCH v5 2/4] Acquire ControlFileLock in SQL functions.

Commit dc7d70ea added functions that read the control file, but didn't
acquire ControlFileLock.  With unlucky timing and on certain file
systems, they could read corrupted data that is in the process of being
overwritten, and the checksum would fail.

Back-patch to all supported releases.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/utils/misc/pg_controldata.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index f2c1084797..a1003a464d 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -24,6 +24,7 @@
 #include "common/controldata_utils.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "storage/lwlock.h"
 #include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
@@ -42,7 +43,9 @@ pg_control_system(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -80,7 +83,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* Read the control file. */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -169,7 +174,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -208,7 +215,9 @@ pg_control_init(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
-- 
2.39.2

v5-0003-Doc-Warn-about-pg_control-corruption-in-low-level.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Doc-Warn-about-pg_control-corruption-in-low-level.patchDownload

From 3b28603553bf483c3cda280eeae6790693b930b6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 25 Jul 2023 11:11:17 +1200
Subject: [PATCH v5 3/4] Doc: Warn about pg_control corruption in low-level
 backups.

On some file systems (ext4, ntfs), copying the control file while it's
being modified to can result in a corrupted file with a bad checksum,
and then recovery fails.  Document a way to test for this problem as
part of the low-level-backup procedure.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 doc/src/sgml/backup.sgml | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..60e110ea0a 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -920,6 +920,23 @@ SELECT pg_backup_start(label => 'label', fast => false);
      consider during this backup.
     </para>
    </listitem>
+   <listitem>
+    <para>
+     One file copied by the previous step requires special consideration.
+     For all other files, the effects of concurrent modification will be
+     corrected by WAL replay, but <literal>global/pg_control</literal> must
+     be self-consistent before replay begins.  On some systems, it might be
+     corrupted by a concurrent modification during the backup.  It can be
+     verified by installing it (after extracting from archive file if
+     applicable) into an otherwise empty directory (e.g.
+     <literal>test_pgdata/global/pg_control</literal>) and then
+     passing the path of the top directory to the
+     <application>pg_controldata</application> program.
+     <application>pg_controldata</application> will verify
+     the internal checksum, and fail if corruption is detected.  In that
+     unlikely case, the file should be backed up again and re-tested.
+    </para>
+   </listitem>
    <listitem>
     <para>
      In the same connection as before, issue the command:
-- 
2.39.2

v5-0004-Try-to-tolerate-torn-reads-of-control-file-in-fro.patchtext/x-patch; charset=US-ASCII; name=v5-0004-Try-to-tolerate-torn-reads-of-control-file-in-fro.patchDownload

From 7c40600581799b12eeb8550aa095385e7adfb5a9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 24 Nov 2022 13:28:22 +1300
Subject: [PATCH v5 4/4] Try to tolerate torn reads of control file in
 frontend.

Some of our src/bin tools read the control file without any kind of
interlocking against concurrent writes.  In the backend we avoid this
problem with ControlFileLock, but we can't do that from a stand-alone
program.

Tolerate the torn read that can occur on some systems (ext4, ntfs) by
retrying if checksum fails, until we get two reads in a row with the
same checksum.  This is only a last ditch effort and not guaranteed to
reach the right conclusion with extremely unlucky scheduling, but it
seems at least very likely to.  Thanks to Tom Lane for this suggestion.

Back-patch to all supported releases.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/common/controldata_utils.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..8b1786512f 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -56,12 +56,22 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 	char		ControlFilePath[MAXPGPATH];
 	pg_crc32c	crc;
 	int			r;
+#ifdef FRONTEND
+	pg_crc32c	last_crc;
+	int			retries = 0;
+#endif
 
 	Assert(crc_ok_p);
 
 	ControlFile = palloc_object(ControlFileData);
 	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
 
+#ifdef FRONTEND
+	INIT_CRC32C(last_crc);
+
+retry:
+#endif
+
 #ifndef FRONTEND
 	if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1)
 		ereport(ERROR,
@@ -117,6 +127,26 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 
 	*crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);
 
+#ifdef FRONTEND
+
+	/*
+	 * With unlucky timing on filesystems that don't implement atomicity of
+	 * concurrent reads and writes, we might have seen garbage if the server
+	 * was writing to the file at the same time.  Keep retrying until we see
+	 * the same CRC twice, with a tiny sleep to give a concurrent writer a
+	 * good chance of making progress.
+	 */
+	if (!*crc_ok_p &&
+		(retries == 0 || !EQ_CRC32C(crc, last_crc)) &&
+		retries < 10)
+	{
+		retries++;
+		last_crc = crc;
+		pg_usleep(10000);
+		goto retry;
+	}
+#endif
+
 	/* Make sure the control file is valid byte order. */
 	if (ControlFile->pg_control_version % 65536 == 0 &&
 		ControlFile->pg_control_version / 65536 != 0)
-- 
2.39.2

#28

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Robert Haas (#26)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Jul 25, 2023 at 8:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

(Yeah, I know we have code to verify checksums during a base
backup, but as discussed elsewhere, it doesn't work.)

BTW the the code you are referring to there seems to think 4KB
page-halves are atomic; not sure if that's imagining page-level
locking in ancient Linux (?), or imagining default setvbuf() buffer
size observed with some specific implementation of fread(), or
confusing power-failure-sector-based atomicity with concurrent access
atomicity, or something else, but for the record what we actually see
in this scenario on ext4 is the old/new page contents mashed together
on much smaller boundaries (maybe cache lines), caused by duelling
concurrent memcpy() to/from, independent of any buffer/page-level
implementation details we might have been thinking of with that code.
Makes me wonder if it's even technically sound to examine the LSN.

It's also why we
have to force full-page write on during a backup. But the whole thing
is nasty because you can't really verify anything about the backup you
just took. It may be full of gibberish blocks but don't worry because,
if all goes well, recovery will fix it. But you won't really know
whether recovery actually does fix it. You just kind of have to cross
your fingers and hope.

Well, not without also scanning the WAL for FPIs, anyway... And
conceptually, that's why I think we probably want an 'FPI' of the
control file somewhere.

#29

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Thomas Munro (#27)

4 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

While chatting to Robert and Andres about all this, a new idea came
up. Or, rather, one of the first ideas that was initially rejected,
now resurrected to try out a suggestion of Andres’s on how to
de-pessimise it. Unfortunately, it also suffers from Windows-specific
problems that I originally mentioned at the top of this thread but
had since repressed. Arrrghgh.

First, the good news:

We could write out a whole new control file, and durable_rename() it
into place. We don’t want to do that in general, because we don’t
want to slow down UpdateMinRecoveryPoint(). The new concept is to do
that only if a backup is in progress. That requires a bit of
interlocking with backup start/stop (ie when runningBackups is
changing in shmem, we don’t want to overlap with UpdateControlFile()'s
decision on how to do it). Here is a patch to try that out. No more
weasel wording needed for the docs; basebackup and low-level file
system backup should always see an atomic control file (and
occasionally also copy a harmless pg_control.tmp file). Then we only
need the gross retry-until-stable hack for front-end programs.

And the bad news:

In my catalogue-of-Windows-weirdness project[1]/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com, I learned in v3-0003 that:

+ fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+ PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+ PG_REQUIRE_SYS(fd < 0 || close(fd) == 0);
+
+ fd = open(path2, O_RDWR | PG_BINARY, 0777);
+ PG_EXPECT_SYS(fd >= 0, "open name2.txt");
+ make_path(path2, "name2.txt");
+#ifdef WIN32
+
+ /*
+ * Windows can't rename over an open non-unlinked file, even with
+ * have_posix_unlink_semantics.
+ */
+ pgwin32_dirmod_loops = 2; /* minimize looping to fail fast in testing */
+ PG_EXPECT_SYS(rename(path, path2) == -1,
+   "Windows: can't rename name1.txt -> name2.txt while name2.txt is open");
+ PG_EXPECT_EQ(errno, EACCES);
+ PG_EXPECT_SYS(unlink(path) == 0, "unlink name1.txt");
+#else
+ PG_EXPECT_SYS(rename(path, path2) == 0,
+   "POSIX: can rename name1.txt -> name2.txt while name2.txt is open");
+#endif
+ PG_EXPECT_SYS(close(fd) == 0);

Luckily the code in dirmod.c:pgrename() should retry lots of times if
a concurrent transient opener/reader comes along, so I think that
should be OK in practice (but if backups_r_us.exe holds the file open
for 10 seconds while we're trying to rename it, I assume we'll PANIC);
call that problem #1. What is slightly more disturbing is the clue in
the "Cygwin cleanup" thread[2]/messages/by-id/CA+hUKG+e13wK0PBX5Z63CCwWm7MfRQuwBRabM_3aKWSko2AUww@mail.gmail.com that rename() can fail to be 100%
atomic, so that a concurrent call to open() can fail with ENOENT (cf.
the POSIX requirement "... a link named new shall remain visible to
other processes throughout the renaming operation and refer either to
the file referred to by new or old ..."). Call that problem #2, a
problem that already causes us rare breakage (for example: could not
open file "pg_logical/snapshots/0-14FE6B0.snap").

I know that problem #1 can be fixed by applying v3-0004 from [1]/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com but
that leads to impossible decisions like revoking support for non-NTFS
filesystems as discussed in that thread, and we certainly couldn't
back-patch that anyway. I assume problem #2 can too.

That makes me want to *also* acquire ControlFileLock, for base
backup's read of pg_control. Even though it seems redundant with the
rename() trick (the rename() trick should be enough for low-level
*and* basebackup on ext4), it would at least avoid the above
Windowsian pathologies during base backups.

I'm sorry for the patch/idea-churn in this thread. It's like
Whac-a-Mole. Blasted non-POSIX-compliant moles. New patches
attached. Are they getting better?

[1]: /messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com
[2]: /messages/by-id/CA+hUKG+e13wK0PBX5Z63CCwWm7MfRQuwBRabM_3aKWSko2AUww@mail.gmail.com

Attachments:

v6-0001-Update-control-file-atomically-during-backups.patchapplication/x-patch; name=v6-0001-Update-control-file-atomically-during-backups.patchDownload

From caf6f0d49d5b24b8c3d1f41b97d7824dadad8f8b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Jul 2023 08:46:19 +1200
Subject: [PATCH v6 1/4] Update control file atomically during backups.

If there is a backup in progress, switch to a rename-based approach when
writing out the control file to avoid problems with torn reads that can
occur on some systems (at least ext4, ntfs).  We don't want to make
UpdateControlFile() slower in general, because it's called frequently
during recovery, so we only do it while runningBackups > 0.  In order to
activate the new behavior without races, we now also acquire
ControlFileLock when backups begin and end.

Back-patch to all supported releases.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/access/transam/xlog.c      | 33 +++++++++++++++++++--
 src/bin/pg_checksums/pg_checksums.c    |  2 +-
 src/bin/pg_resetwal/pg_resetwal.c      |  2 +-
 src/bin/pg_rewind/pg_rewind.c          |  2 +-
 src/common/controldata_utils.c         | 41 ++++++++++++++++++++++++--
 src/include/common/controldata_utils.h |  4 ++-
 6 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f7d4750fc0..c4626638e2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,7 +442,9 @@ typedef struct XLogCtlInsert
 	/*
 	 * runningBackups is a counter indicating the number of backups currently
 	 * in progress. lastBackupStart is the latest checkpoint redo location
-	 * used as a starting point for an online backup.
+	 * used as a starting point for an online backup.  runningBackups is
+	 * protected by both ControlFileLock and WAL insertion lock (in that
+	 * order), because it affects the behavior of UpdateControlFile().
 	 */
 	int			runningBackups;
 	XLogRecPtr	lastBackupStart;
@@ -4170,7 +4172,13 @@ ReadControlFile(void)
 static void
 UpdateControlFile(void)
 {
-	update_controlfile(DataDir, ControlFile, true);
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	bool		atomic;
+
+	Assert(LWLockHeldByMeInMode(ControlFileLock, LW_EXCLUSIVE));
+	atomic = Insert->runningBackups > 0;
+
+	update_controlfile(DataDir, ControlFile, atomic, true);
 }
 
 /*
@@ -5306,9 +5314,12 @@ StartupXLOG(void)
 		 * pg_control with any minimum recovery stop point obtained from a
 		 * backup history file.
 		 *
-		 * No need to hold ControlFileLock yet, we aren't up far enough.
+		 * ControlFileLock not really needed yet, we aren't up far enough, but
+		 * makes assertions simpler.
 		 */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -8293,6 +8304,12 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 
 	memcpy(state->name, backupidstr, strlen(backupidstr));
 
+	/*
+	 * UpdateControlFile() behaves differently while backups are running, and
+	 * we need to avoid races when the backups start.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
 	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
 	 * during an on-line backup even if not doing so at other times, because
@@ -8318,6 +8335,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 	XLogCtl->Insert.runningBackups++;
 	WALInsertLockRelease();
 
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * Ensure we decrement runningBackups if we fail below. NB -- for this to
 	 * work correctly, it is critical that sessionBackupState is only updated
@@ -8608,6 +8627,12 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 				 errmsg("WAL level not sufficient for making an online backup"),
 				 errhint("wal_level must be set to \"replica\" or \"logical\" at server start.")));
 
+	/*
+	 * Interlocking with UpdateControlFile(), so that it can start using atomic
+	 * mode while backups are running.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
 	/*
 	 * OK to update backup counter and session-level lock.
 	 *
@@ -8637,6 +8662,8 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 
 	WALInsertLockRelease();
 
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * If we are taking an online backup from the standby, we confirm that the
 	 * standby has not been promoted during the backup.
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 19eb67e485..80411e2d9f 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -637,7 +637,7 @@ main(int argc, char *argv[])
 		}
 
 		pg_log_info("updating control file");
-		update_controlfile(DataDir, ControlFile, do_sync);
+		update_controlfile(DataDir, ControlFile, false, do_sync);
 
 		if (verbose)
 			printf(_("Data checksum version: %u\n"), ControlFile->data_checksum_version);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e7ef2b8bd0..1dcc84abea 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -884,7 +884,7 @@ RewriteControlFile(void)
 	ControlFile.max_locks_per_xact = 64;
 
 	/* The control file gets flushed here. */
-	update_controlfile(".", &ControlFile, true);
+	update_controlfile(".", &ControlFile, false, true);
 }
 
 
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index f7f3b8227f..3a34b4d0dd 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -714,7 +714,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
 	if (!dry_run)
-		update_controlfile(datadir_target, &ControlFile_new, do_sync);
+		update_controlfile(datadir_target, &ControlFile_new, false, do_sync);
 }
 
 static void
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..475c2e524c 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -140,14 +140,27 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
  * optionally used to flush the updated control file.  Note that it is up
  * to the caller to properly lock ControlFileLock when calling this
  * routine in the backend.
+ *
+ * If atomic is true, a slower procedure is used in backend code that renames a
+ * temporary file into place.  This is intended for use while backups are in
+ * progress, to avoid the possibility of a torn read of the control file.  Note
+ * that this refers to atomicity of concurrent reads and writes.  (Atomicity
+ * under power failure is a separate topic; we always assume that writes of
+ * size PG_CONTROL_MAX_SAFE_SIZE are power-failure-atomic, even when 'atomic'
+ * is false here.)
  */
 void
 update_controlfile(const char *DataDir,
-				   ControlFileData *ControlFile, bool do_sync)
+				   ControlFileData *ControlFile,
+				   bool atomic,
+				   bool do_sync)
 {
 	int			fd;
 	char		buffer[PG_CONTROL_FILE_SIZE];
 	char		ControlFilePath[MAXPGPATH];
+#ifndef FRONTEND
+	int			flags = 0;
+#endif
 
 	/* Update timestamp  */
 	ControlFile->time = (pg_time_t) time(NULL);
@@ -167,7 +180,18 @@ update_controlfile(const char *DataDir,
 	memset(buffer, 0, PG_CONTROL_FILE_SIZE);
 	memcpy(buffer, ControlFile, sizeof(ControlFileData));
 
-	snprintf(ControlFilePath, sizeof(ControlFilePath), "%s/%s", DataDir, XLOG_CONTROL_FILE);
+	/* In atomic mode, we'll write into a new temporary file first. */
+	snprintf(ControlFilePath,
+			 sizeof(ControlFilePath),
+			 "%s/%s%s",
+			 DataDir,
+			 XLOG_CONTROL_FILE,
+			 atomic ? ".tmp" : "");
+#ifndef FRONTEND
+	flags = 0;
+	if (atomic)
+		flags = O_CREAT;
+#endif
 
 #ifndef FRONTEND
 
@@ -175,7 +199,7 @@ update_controlfile(const char *DataDir,
 	 * All errors issue a PANIC, so no need to use OpenTransientFile() and to
 	 * worry about file descriptor leaks.
 	 */
-	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY)) < 0)
+	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY | flags)) < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
@@ -236,4 +260,15 @@ update_controlfile(const char *DataDir,
 		pg_fatal("could not close file \"%s\": %m", ControlFilePath);
 #endif
 	}
+
+#ifndef FRONTEND
+	if (atomic)
+	{
+		char		path[MAXPGPATH];
+
+		snprintf(path, sizeof(path), "%s/%s", DataDir, XLOG_CONTROL_FILE);
+
+		durable_rename(ControlFilePath, path, PANIC);
+	}
+#endif
 }
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 49e7c52d31..0d9b41a2ba 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -14,6 +14,8 @@
 
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
-							   ControlFileData *ControlFile, bool do_sync);
+							   ControlFileData *ControlFile,
+							   bool atomic,
+							   bool do_sync);
 
 #endif							/* COMMON_CONTROLDATA_UTILS_H */
-- 
2.39.2 (Apple Git-143)

v6-0002-Try-to-tolerate-torn-reads-of-control-file-in-fro.patchapplication/x-patch; name=v6-0002-Try-to-tolerate-torn-reads-of-control-file-in-fro.patchDownload

From 69fd24b670b24424bc3da3e3c654750cf20826c4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 24 Nov 2022 13:28:22 +1300
Subject: [PATCH v6 2/4] Try to tolerate torn reads of control file in
 frontend.

Some of our src/bin tools read the control file without any kind of
interlocking against concurrent writes.

Tolerate the torn read that can occur on some systems (ext4, ntfs) by
retrying if checksum fails, until we get two reads in a row with the
same checksum.  This is only a last ditch effort and not guaranteed to
reach the right conclusion with extremely unlucky scheduling, but it
seems at least very likely to.  Thanks to Tom Lane for this suggestion.

Back-patch to all supported releases.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru> (earlier version)
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/common/controldata_utils.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 475c2e524c..9745a4c926 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -56,12 +56,22 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 	char		ControlFilePath[MAXPGPATH];
 	pg_crc32c	crc;
 	int			r;
+#ifdef FRONTEND
+	pg_crc32c	last_crc;
+	int			retries = 0;
+#endif
 
 	Assert(crc_ok_p);
 
 	ControlFile = palloc_object(ControlFileData);
 	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
 
+#ifdef FRONTEND
+	INIT_CRC32C(last_crc);
+
+retry:
+#endif
+
 #ifndef FRONTEND
 	if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1)
 		ereport(ERROR,
@@ -117,6 +127,26 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 
 	*crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);
 
+#ifdef FRONTEND
+
+	/*
+	 * With unlucky timing on filesystems that don't implement atomicity of
+	 * concurrent reads and writes, we might have seen garbage if the server
+	 * was writing to the file at the same time.  Keep retrying until we see
+	 * the same CRC twice, with a tiny sleep to give a concurrent writer a
+	 * good chance of making progress.
+	 */
+	if (!*crc_ok_p &&
+		(retries == 0 || !EQ_CRC32C(crc, last_crc)) &&
+		retries < 10)
+	{
+		retries++;
+		last_crc = crc;
+		pg_usleep(10000);
+		goto retry;
+	}
+#endif
+
 	/* Make sure the control file is valid byte order. */
 	if (ControlFile->pg_control_version % 65536 == 0 &&
 		ControlFile->pg_control_version / 65536 != 0)
-- 
2.39.2 (Apple Git-143)

v6-0003-Acquire-ControlFileLock-in-base-backups.patchapplication/x-patch; name=v6-0003-Acquire-ControlFileLock-in-base-backups.patchDownload

From 22c2acf613f62467efe3029a533d565d735864b2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH v6 3/4] Acquire ControlFileLock in base backups.

Even though we now rename() the control file to gain atomicity during
backups, we can avoid some unpleasant retry behaviors or worse on
Windows by also acquiring ControlFileLock during base backups.  (This is
similar to commit a2e97cb2.)

Back-patch to all supported releases.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/backup/basebackup.c | 36 ++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..f03496665f 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -24,6 +24,7 @@
 #include "backup/basebackup_target.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
+#include "common/controldata_utils.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
@@ -39,6 +40,7 @@
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -84,7 +86,7 @@ static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfile
 					 struct stat *statbuf, bool missing_ok, Oid dboid,
 					 backup_manifest_info *manifest, const char *spcoid);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -315,7 +317,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
+				ControlFileData *control_file;
+				bool		crc_ok;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
 
@@ -324,14 +327,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				/* In the main tar, include the backup_label first... */
 				backup_label = build_backup_content(backup_state, false);
 				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
+									backup_label, -1, &manifest);
 				pfree(backup_label);
 
 				/* Then the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -340,13 +343,13 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 						sendtblspclinks, &manifest, NULL);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, &manifest, NULL);
+				LWLockAcquire(ControlFileLock, LW_SHARED);
+				control_file = get_controlfile(DataDir, &crc_ok);
+				LWLockRelease(ControlFileLock);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *) control_file, sizeof(*control_file),
+									&manifest);
+				pfree(control_file);
 			}
 			else
 			{
@@ -591,7 +594,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -619,7 +622,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1030,18 +1033,19 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
+					int len,
 					backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	if (len < 0)
+		len = strlen(content);
 
 	/*
 	 * Construct a stat struct for the backup_label file we're injecting in
-- 
2.39.2 (Apple Git-143)

v6-0004-Acquire-ControlFileLock-in-SQL-functions.patchapplication/x-patch; name=v6-0004-Acquire-ControlFileLock-in-SQL-functions.patchDownload

From b7b5e85f0e4929441ab27402de5afd32392d9193 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 11:50:27 +1200
Subject: [PATCH v6 4/4] Acquire ControlFileLock in SQL functions.

Commit dc7d70ea added functions that read the control file, but didn't
acquire ControlFileLock.  With unlucky timing and on certain file
systems, they could read corrupted data that is in the process of being
overwritten, and the checksum would fail.

Back-patch to all supported releases.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/utils/misc/pg_controldata.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index f2c1084797..a1003a464d 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -24,6 +24,7 @@
 #include "common/controldata_utils.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "storage/lwlock.h"
 #include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
@@ -42,7 +43,9 @@ pg_control_system(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -80,7 +83,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* Read the control file. */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -169,7 +174,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -208,7 +215,9 @@ pg_control_init(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
-- 
2.39.2 (Apple Git-143)

#30

David Steele

david@pgmasters.net

over 2 years ago

In reply to: Thomas Munro (#29)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi Thomas,

On 7/26/23 06:06, Thomas Munro wrote:

While chatting to Robert and Andres about all this, a new idea came
up. Or, rather, one of the first ideas that was initially rejected,
now resurrected to try out a suggestion of Andres’s on how to
de-pessimise it. Unfortunately, it also suffers from Windows-specific
problems that I originally mentioned at the top of this thread but
had since repressed. Arrrghgh.

First, the good news:

We could write out a whole new control file, and durable_rename() it
into place. We don’t want to do that in general, because we don’t
want to slow down UpdateMinRecoveryPoint(). The new concept is to do
that only if a backup is in progress. That requires a bit of
interlocking with backup start/stop (ie when runningBackups is
changing in shmem, we don’t want to overlap with UpdateControlFile()'s
decision on how to do it). Here is a patch to try that out. No more
weasel wording needed for the docs; basebackup and low-level file
system backup should always see an atomic control file (and
occasionally also copy a harmless pg_control.tmp file). Then we only
need the gross retry-until-stable hack for front-end programs.

I like the approach in these patches better than the last patch set. My
only concern would be possible performance regression on standbys (when
doing backup from standby) since pg_control can be written very
frequently to update min recovery point.

I've made a first pass through the patches and they look generally
reasonable (and back patch-able).

One thing:

+ sendFileWithContent(sink, XLOG_CONTROL_FILE,
+                     (char *) control_file, sizeof(*control_file),
+                     &manifest);

I wonder if we should pad pg_control out to 8k so it remains the same
size as now? Postgres doesn't care, but might look odd to users, and is
arguably a change in behavior that should not be back patched.

And the bad news:

Provided we can reasonably address the Windows issues this seems to be
the way to go.

Regards,
-David

#31

Anton A. Melnikov

aamelnikov@inbox.ru

over 2 years ago

In reply to: Thomas Munro (#29)

1 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hello!

On 26.07.2023 07:06, Thomas Munro wrote:

New patches
attached. Are they getting better?

It seems to me that it is worth focusing efforts on the second part of the patch,
as the most in demand. And try to commit it first.

And seems there is a way to simplify it by adding a parameter to get_controlfile() that will return calculated
crc and moving the repetition logic level up.

There is a proposed algorithm in alg_level_up.pdf attached.

[Excuse me, for at least three days i will be in a place where there is no available Internet. \
So will be able to read this thread no earlier than August 2 evening]

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#32

Anton A. Melnikov

aamelnikov@inbox.ru

over 2 years ago

In reply to: Thomas Munro (#29)

1 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Sorry, attached the wrong version of the file. Here is the right one.

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#33

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: David Steele (#30)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]/messages/by-id/1330cb48-4e47-03ca-f2fb-b144b49514d8@pgmasters.net
for backups, side stepping the problem.

[1]: /messages/by-id/1330cb48-4e47-03ca-f2fb-b144b49514d8@pgmasters.net

#34

Michael Paquier

michael@paquier.xyz

over 2 years ago

In reply to: Thomas Munro (#33)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.

FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead. 0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.
--
Michael

#35

David Steele

david@pgmasters.net

over 2 years ago

In reply to: Michael Paquier (#34)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On 10/11/23 21:10, Michael Paquier wrote:

On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.

FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead. 0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.

Agreed on 0002 and 0004, though I don't really think a back patch of
0004 to 11 is necessary. I'd feel differently if there was a single
field report of this issue.

I would prefer to hold off on applying 0003 to HEAD until we see how [1]/messages/by-id/1330cb48-4e47-03ca-f2fb-b144b49514d8@pgmasters.net
pans out.

Having said that, I have a hard time seeing [1]/messages/by-id/1330cb48-4e47-03ca-f2fb-b144b49514d8@pgmasters.net as being something we
could back patch. The manipulation of backup_label is simple enough, but
starting a cluster without pg_control is definitely going to change some
things. Also, the requirement that backup software skip copying
pg_control after a minor release is not OK.

If we do back patch 0001 is 0003 really needed? Surely if 0001 works
with other backup software it would work fine for pg_basebackup.

Regards,
-David

[1]: /messages/by-id/1330cb48-4e47-03ca-f2fb-b144b49514d8@pgmasters.net
/messages/by-id/1330cb48-4e47-03ca-f2fb-b144b49514d8@pgmasters.net

#36

David Steele

david@pgmasters.net

over 2 years ago

In reply to: David Steele (#35)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On 10/12/23 09:58, David Steele wrote:

On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.

FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead. 0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.

Agreed on 0002 and 0004, though I don't really think a back patch of
0004 to 11 is necessary. I'd feel differently if there was a single
field report of this issue.

I would prefer to hold off on applying 0003 to HEAD until we see how [1]
pans out.

Having said that, I have a hard time seeing [1] as being something we
could back patch. The manipulation of backup_label is simple enough, but
starting a cluster without pg_control is definitely going to change some
things. Also, the requirement that backup software skip copying
pg_control after a minor release is not OK.

After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with
more advanced features (e.g. error on backup_label and pg_control both
present on initial cluster start) saved for HEAD.

Regards,
-David

#37

Michael Paquier

michael@paquier.xyz

over 2 years ago

In reply to: David Steele (#36)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:

After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with more
advanced features (e.g. error on backup_label and pg_control both present on
initial cluster start) saved for HEAD.

I doubt that anything changed in this area would be in the
backpatchable zone, particularly as it would involve protocol changes
within the replication commands, so I'd recommend to focus on HEAD.
Backward-compatibility is not much of a conern as long as the backend
is involved. The real problem here would be on the frontend side and
how much effort we should try to put in maintaining the code of
pg_basebackup compatible with older backends.
--
Michael

#38

David Steele

david@pgmasters.net

about 2 years ago

In reply to: Michael Paquier (#37)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On 10/12/23 19:15, Michael Paquier wrote:

On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:

After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with more
advanced features (e.g. error on backup_label and pg_control both present on
initial cluster start) saved for HEAD.

I doubt that anything changed in this area would be in the
backpatchable zone, particularly as it would involve protocol changes
within the replication commands, so I'd recommend to focus on HEAD.

I can't see why there would be any protocol changes, but perhaps I am
missing something?

One thing that does have to change, however, is the ordering of
backup_label in the base tar file. Right now it is at the beginning but
we need it to be at the end like pg_control is now.

I'm working up a POC patch now and hope to have something today or
tomorrow. I think it makes sense to at least have a look at an
alternative solution before going forward.

Regards,
-David

#39

David Steele

david@pgmasters.net

about 2 years ago

In reply to: David Steele (#38)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On 10/13/23 10:40, David Steele wrote:

On 10/12/23 19:15, Michael Paquier wrote:

On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:

After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with
more
advanced features (e.g. error on backup_label and pg_control both
present on
initial cluster start) saved for HEAD.

I doubt that anything changed in this area would be in the
backpatchable zone, particularly as it would involve protocol changes
within the replication commands, so I'd recommend to focus on HEAD.

I can't see why there would be any protocol changes, but perhaps I am
missing something?

One thing that does have to change, however, is the ordering of
backup_label in the base tar file. Right now it is at the beginning but
we need it to be at the end like pg_control is now.

Well, no protocol changes, but overall this does not seem like a
direction that would be even remotely back patch-able. See [1]/messages/by-id/e05f20f9-6f91-9a70-efab-9a2ae472e65d@pgmasters.net for details.

For back branches that puts us back to committing some form of 0001 and
0003. I'm still worried about the performance implications of 0001 on a
standby when in backup mode, but I don't have any better ideas.

If we do commit 0001 and 0003 to the back branches I'd still like to
hold off on HEAD to see if we can do something better there.

Regards,
-David

[1]: /messages/by-id/e05f20f9-6f91-9a70-efab-9a2ae472e65d@pgmasters.net
/messages/by-id/e05f20f9-6f91-9a70-efab-9a2ae472e65d@pgmasters.net

#40

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: David Steele (#39)

2 attachment(s)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

I pushed the retry-loop-in-frontend-executables patch and the
missing-locking-in-SQL-functions patch yesterday. That leaves the
backup ones, which I've rebased and attached, no change. It sounds
like we need some more healthy debate about that backup label idea
that would mean we don't need these (two birds with one stone), so
I'll just leave these here to keep the cfbot happy in the meantime.

Attachments:

v7-0001-Update-control-file-atomically-during-backups.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Update-control-file-atomically-during-backups.patchDownload

From 3ab478ba3bb1712e6fc529f4bdaf48b58a6d72c5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Jul 2023 08:46:19 +1200
Subject: [PATCH v7 1/2] Update control file atomically during backups.

If there is a backup in progress, switch to a rename-based approach when
writing out the control file to avoid problems with torn reads that can
occur on some systems (at least ext4, ntfs).  We don't want to make
UpdateControlFile() slower in general, because it's called frequently
during recovery, so we only do it while runningBackups > 0.  In order to
activate the new behavior without races, we now also acquire
ControlFileLock when backups begin and end.

Back-patch to all supported releases.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Reviewed-by: David Steele <david@pgmasters.net>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/access/transam/xlog.c      | 33 +++++++++++++++++++--
 src/bin/pg_checksums/pg_checksums.c    |  2 +-
 src/bin/pg_resetwal/pg_resetwal.c      |  2 +-
 src/bin/pg_rewind/pg_rewind.c          |  2 +-
 src/common/controldata_utils.c         | 41 ++++++++++++++++++++++++--
 src/include/common/controldata_utils.h |  4 ++-
 6 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0e4ca5089..c82e5c6ad4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,7 +442,9 @@ typedef struct XLogCtlInsert
 	/*
 	 * runningBackups is a counter indicating the number of backups currently
 	 * in progress. lastBackupStart is the latest checkpoint redo location
-	 * used as a starting point for an online backup.
+	 * used as a starting point for an online backup.  runningBackups is
+	 * protected by both ControlFileLock and WAL insertion lock (in that
+	 * order), because it affects the behavior of UpdateControlFile().
 	 */
 	int			runningBackups;
 	XLogRecPtr	lastBackupStart;
@@ -4208,7 +4210,13 @@ ReadControlFile(void)
 static void
 UpdateControlFile(void)
 {
-	update_controlfile(DataDir, ControlFile, true);
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	bool		atomic;
+
+	Assert(LWLockHeldByMeInMode(ControlFileLock, LW_EXCLUSIVE));
+	atomic = Insert->runningBackups > 0;
+
+	update_controlfile(DataDir, ControlFile, atomic, true);
 }
 
 /*
@@ -5344,9 +5352,12 @@ StartupXLOG(void)
 		 * pg_control with any minimum recovery stop point obtained from a
 		 * backup history file.
 		 *
-		 * No need to hold ControlFileLock yet, we aren't up far enough.
+		 * ControlFileLock not really needed yet, we aren't up far enough, but
+		 * makes assertions simpler.
 		 */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -8335,6 +8346,12 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 
 	memcpy(state->name, backupidstr, strlen(backupidstr));
 
+	/*
+	 * UpdateControlFile() behaves differently while backups are running, and
+	 * we need to avoid races when the backups start.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
 	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
 	 * during an on-line backup even if not doing so at other times, because
@@ -8360,6 +8377,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 	XLogCtl->Insert.runningBackups++;
 	WALInsertLockRelease();
 
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * Ensure we decrement runningBackups if we fail below. NB -- for this to
 	 * work correctly, it is critical that sessionBackupState is only updated
@@ -8650,6 +8669,12 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 				 errmsg("WAL level not sufficient for making an online backup"),
 				 errhint("wal_level must be set to \"replica\" or \"logical\" at server start.")));
 
+	/*
+	 * Interlocking with UpdateControlFile(), so that it can start using atomic
+	 * mode while backups are running.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
 	/*
 	 * OK to update backup counter and session-level lock.
 	 *
@@ -8679,6 +8704,8 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 
 	WALInsertLockRelease();
 
+	LWLockRelease(ControlFileLock);
+
 	/*
 	 * If we are taking an online backup from the standby, we confirm that the
 	 * standby has not been promoted during the backup.
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index e009ba5e0b..4dab08b4a0 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -634,7 +634,7 @@ main(int argc, char *argv[])
 		}
 
 		pg_log_info("updating control file");
-		update_controlfile(DataDir, ControlFile, do_sync);
+		update_controlfile(DataDir, ControlFile, false, do_sync);
 
 		if (verbose)
 			printf(_("Data checksum version: %u\n"), ControlFile->data_checksum_version);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..a9142a213d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -889,7 +889,7 @@ RewriteControlFile(void)
 	ControlFile.max_locks_per_xact = 64;
 
 	/* The control file gets flushed here. */
-	update_controlfile(".", &ControlFile, true);
+	update_controlfile(".", &ControlFile, false, true);
 }
 
 
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index bfd44a284e..6b0531f4d1 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -723,7 +723,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
 	if (!dry_run)
-		update_controlfile(datadir_target, &ControlFile_new, do_sync);
+		update_controlfile(datadir_target, &ControlFile_new, false, do_sync);
 }
 
 static void
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 4d1cd1ce98..8f2587661a 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -170,14 +170,27 @@ retry:
  * optionally used to flush the updated control file.  Note that it is up
  * to the caller to properly lock ControlFileLock when calling this
  * routine in the backend.
+ *
+ * If atomic is true, a slower procedure is used in backend code that renames a
+ * temporary file into place.  This is intended for use while backups are in
+ * progress, to avoid the possibility of a torn read of the control file.  Note
+ * that this refers to atomicity of concurrent reads and writes.  (Atomicity
+ * under power failure is a separate topic; we always assume that writes of
+ * size PG_CONTROL_MAX_SAFE_SIZE are power-failure-atomic, even when 'atomic'
+ * is false here.)
  */
 void
 update_controlfile(const char *DataDir,
-				   ControlFileData *ControlFile, bool do_sync)
+				   ControlFileData *ControlFile,
+				   bool atomic,
+				   bool do_sync)
 {
 	int			fd;
 	char		buffer[PG_CONTROL_FILE_SIZE];
 	char		ControlFilePath[MAXPGPATH];
+#ifndef FRONTEND
+	int			flags = 0;
+#endif
 
 	/* Update timestamp  */
 	ControlFile->time = (pg_time_t) time(NULL);
@@ -197,7 +210,18 @@ update_controlfile(const char *DataDir,
 	memset(buffer, 0, PG_CONTROL_FILE_SIZE);
 	memcpy(buffer, ControlFile, sizeof(ControlFileData));
 
-	snprintf(ControlFilePath, sizeof(ControlFilePath), "%s/%s", DataDir, XLOG_CONTROL_FILE);
+	/* In atomic mode, we'll write into a new temporary file first. */
+	snprintf(ControlFilePath,
+			 sizeof(ControlFilePath),
+			 "%s/%s%s",
+			 DataDir,
+			 XLOG_CONTROL_FILE,
+			 atomic ? ".tmp" : "");
+#ifndef FRONTEND
+	flags = 0;
+	if (atomic)
+		flags = O_CREAT;
+#endif
 
 #ifndef FRONTEND
 
@@ -205,7 +229,7 @@ update_controlfile(const char *DataDir,
 	 * All errors issue a PANIC, so no need to use OpenTransientFile() and to
 	 * worry about file descriptor leaks.
 	 */
-	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY)) < 0)
+	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY | flags)) < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
@@ -266,4 +290,15 @@ update_controlfile(const char *DataDir,
 		pg_fatal("could not close file \"%s\": %m", ControlFilePath);
 #endif
 	}
+
+#ifndef FRONTEND
+	if (atomic)
+	{
+		char		path[MAXPGPATH];
+
+		snprintf(path, sizeof(path), "%s/%s", DataDir, XLOG_CONTROL_FILE);
+
+		durable_rename(ControlFilePath, path, PANIC);
+	}
+#endif
 }
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 49e7c52d31..0d9b41a2ba 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -14,6 +14,8 @@
 
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
-							   ControlFileData *ControlFile, bool do_sync);
+							   ControlFileData *ControlFile,
+							   bool atomic,
+							   bool do_sync);
 
 #endif							/* COMMON_CONTROLDATA_UTILS_H */
-- 
2.42.0

v7-0002-Acquire-ControlFileLock-in-base-backups.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Acquire-ControlFileLock-in-base-backups.patchDownload

From 800f5c290cc73c5b461d44c66d4d9ebe830bd08f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH v7 2/2] Acquire ControlFileLock in base backups.

Even though we now rename() the control file to gain atomicity during
backups, we can avoid some unpleasant retry behaviors and problems
observed with the atomicity of rename() on Windows if we also acquire
ControlFileLock to read the file during base backups.

Back-patch to all supported releases.

Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Reviewed-by: David Steele <david@pgmasters.net>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/backup/basebackup.c | 36 ++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 7d025bcf38..93a8949de1 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -24,6 +24,7 @@
 #include "backup/basebackup_target.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
+#include "common/controldata_utils.h"
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "lib/stringinfo.h"
@@ -39,6 +40,7 @@
 #include "storage/checksum.h"
 #include "storage/dsm_impl.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -93,7 +95,7 @@ static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
 								 BlockNumber blkno,
 								 uint16 *expected_checksum);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -324,7 +326,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
+				ControlFileData *control_file;
+				bool		crc_ok;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
 
@@ -333,14 +336,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				/* In the main tar, include the backup_label first... */
 				backup_label = build_backup_content(backup_state, false);
 				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
+									backup_label, -1, &manifest);
 				pfree(backup_label);
 
 				/* Then the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -349,13 +352,13 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 						sendtblspclinks, &manifest, NULL);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, &manifest, NULL);
+				LWLockAcquire(ControlFileLock, LW_SHARED);
+				control_file = get_controlfile(DataDir, &crc_ok);
+				LWLockRelease(ControlFileLock);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *) control_file, sizeof(*control_file),
+									&manifest);
+				pfree(control_file);
 			}
 			else
 			{
@@ -600,7 +603,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -628,7 +631,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1039,18 +1042,19 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
+					int len,
 					backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	if (len < 0)
+		len = strlen(content);
 
 	/*
 	 * Construct a stat struct for the backup_label file we're injecting in
-- 
2.42.0

#41

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Thomas Munro (#40)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Mon, Oct 16, 2023 at 6:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I pushed the retry-loop-in-frontend-executables patch and the
missing-locking-in-SQL-functions patch yesterday. That leaves the
backup ones, which I've rebased and attached, no change. It sounds
like we need some more healthy debate about that backup label idea
that would mean we don't need these (two birds with one stone), so
I'll just leave these here to keep the cfbot happy in the meantime.

0002 has no comments at all, and the commit message is not specific
enough for me to understand what problem it fixes. I suggest adding
some comments and fixing the commit message. I'm also not very sure
whether the change to the signature of sendFileWithContent is really
the best way to deal with the control file maybe containing a zero
byte ... but I'm quite sure that if we're going to do it that way, it
needs a comment. But maybe we should do something else that would
require less explanation, like having the caller always pass the
length.

Regarding 0001, the way you've hacked up update_controlfile() doesn't
fill me with joy. It's nice if code that is common to the frontend and
the backend does the same thing in both cases rather than, say, having
an extra argument that only works in one case but not the other. I bet
this could be refactored to make it nicer, e.g. have one function that
takes an exact pathname at which the control file is to be written and
then other functions that use it as a subroutine.

Personally, I think the general idea of 0001 is better than any
competing proposal on the table. In the case of pg_basebackup, we
could fix the server to perform appropriate locking around reading the
control file, so that the version sent to the client doesn't get torn.
But if a backup is made by calling pg_backup_start() and copying
$PGDATA, that isn't going to work. To fix that, we need to either make
the backup procedure more complicated and essentially treat the
control file as a special case, or we need to do something like this.
I think this is better; but as you mention, opinions vary on that.

Life would be a lot easier here if we could get rid of the low-level
backup API and just have pg_basebackup DTWT, but that seems like a
completely non-viable proposal.

--
Robert Haas
EDB: http://www.enterprisedb.com

#42

David G. Johnston

david.g.johnston@gmail.com

about 2 years ago

In reply to: Robert Haas (#41)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Oct 17, 2023 at 10:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

Life would be a lot easier here if we could get rid of the low-level
backup API and just have pg_basebackup DTWT, but that seems like a
completely non-viable proposal.

Yeah, my contribution to this area [1]/messages/by-id/CAKFQuwbpz4s8XP_+Khsif2eFaC78wpTbNbevUYBmjq-UCeNL7Q@mail.gmail.com is focusing on the API because I
figured we've provided it and should do our best to have it do as much as
possible for the dba or third-parties that build tooling on top of it.

I kinda think that adding a pg_backup_metadata directory that
pg_backup_start|stop can use may help here. I'm wondering whether those
functions provide enough control guarantees that pg_control's
"in_backup=true|false" flag proposed in that thread is reliable enough when
copied to the root directory in the backup. I kinda feel that so long as
the flag is reliable it should be possible for the signal file processing
code to implement whatever protocol we need.

David J.

[1]: /messages/by-id/CAKFQuwbpz4s8XP_+Khsif2eFaC78wpTbNbevUYBmjq-UCeNL7Q@mail.gmail.com
/messages/by-id/CAKFQuwbpz4s8XP_+Khsif2eFaC78wpTbNbevUYBmjq-UCeNL7Q@mail.gmail.com

#43

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Thomas Munro (#40)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, 17 Oct 2023 at 04:18, Thomas Munro <thomas.munro@gmail.com> wrote:

I pushed the retry-loop-in-frontend-executables patch and the
missing-locking-in-SQL-functions patch yesterday. That leaves the
backup ones, which I've rebased and attached, no change. It sounds
like we need some more healthy debate about that backup label idea
that would mean we don't need these (two birds with one stone), so
I'll just leave these here to keep the cfbot happy in the meantime.

I have changed the status of this to "Waiting on Author" as Robert's
comments have not yet been handled. Feel free to post an updated
version and change the status accordingly.

Regards,
Vignesh

#44

vignesh C

vignesh21@gmail.com

almost 2 years ago

In reply to: vignesh C (#43)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Thu, 11 Jan 2024 at 19:50, vignesh C <vignesh21@gmail.com> wrote:

On Tue, 17 Oct 2023 at 04:18, Thomas Munro <thomas.munro@gmail.com> wrote:

I pushed the retry-loop-in-frontend-executables patch and the
missing-locking-in-SQL-functions patch yesterday. That leaves the
backup ones, which I've rebased and attached, no change. It sounds
like we need some more healthy debate about that backup label idea
that would mean we don't need these (two birds with one stone), so
I'll just leave these here to keep the cfbot happy in the meantime.

I have changed the status of this to "Waiting on Author" as Robert's
comments have not yet been handled. Feel free to post an updated
version and change the status accordingly.

The patch which you submitted has been awaiting your attention for
quite some time now. As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible. Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.

Regards,
Vignesh