race condition when writing pg_control

Started by Bossart, Nathanover 5 years ago30 messages

bossartn@amazon.com

over 5 years ago

1 attachment(s)

Hi hackers,

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

Nathan

Attachments:

v1-0001-Prevent-race-condition-when-writing-pg_control.patchapplication/octet-stream; name=v1-0001-Prevent-race-condition-when-writing-pg_control.patchDownload

From 56d1fa121f77ea7a0b5173b546877c76393f94ab Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Mon, 4 May 2020 17:38:46 +0000
Subject: [PATCH v1 1/1] Prevent race condition when writing pg_control.

---
 src/backend/access/transam/xlog.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 065eb275b1..cf1e9ca006 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9971,7 +9971,9 @@ xlog_redo(XLogReaderState *record)
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
@@ -10028,7 +10030,9 @@ xlog_redo(XLogReaderState *record)
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
-- 
2.16.6

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Bossart, Nathan (#1)

Re: race condition when writing pg_control

On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

This does indeed look pretty dodgy. CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking. It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem. Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#2)

1 attachment(s)

Re: race condition when writing pg_control

On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

This does indeed look pretty dodgy. CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking. It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem. Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.

Here's a version with a commit message added. I'll push this to all
releases in a day or two if there are no objections.

Attachments:

0001-Fix-race-condition-that-could-corrupt-pg_control.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-race-condition-that-could-corrupt-pg_control.patchDownload

From 28ee04b537085148c75f575d2bc755a1fffc19c7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 22 May 2020 16:23:59 +1200
Subject: [PATCH] Fix race condition that could corrupt pg_control.

The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire
ControlFileLock before modifying ControlFile->checkPointCopy, or the
checkpointer could write out a control file with a bad checksum.

Back-patch to all supported release.

Author: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
---
 src/backend/access/transam/xlog.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..274b808476 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9976,7 +9976,9 @@ xlog_redo(XLogReaderState *record)
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
@@ -10033,7 +10035,9 @@ xlog_redo(XLogReaderState *record)
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
-- 
2.20.1

Fujii Masao

masao.fujii@oss.nttdata.com

over 5 years ago

In reply to: Thomas Munro (#3)

Re: race condition when writing pg_control

On 2020/05/22 13:51, Thomas Munro wrote:

On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

This does indeed look pretty dodgy. CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking. It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem. Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.

Here's a version with a commit message added. I'll push this to all
releases in a day or two if there are no objections.

+1 to push the patch.

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Fujii Masao (#4)

Re: race condition when writing pg_control

On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote:

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Yeah. It also checks the control file values, implying that we should
have LW_SHARED taken at least at the beginning, but this lock cannot
be upgraded we need LW_EXCLUSIVE the whole time. I am wondering if we
should check with an assert if ControlFileLock is taken when going
through UpdateControlFile(). We have one code path at the beginning
of redo where we don't need a lock close to the backup_label file
checks, but we could just pass down a boolean flag to the routine to
handle that case. Another good thing in having an assert is that any
new caller of UpdateControlFile() would need to think about the need
of a lock.
--
Michael

Bossart, Nathan

bossartn@amazon.com

over 5 years ago

In reply to: Thomas Munro (#3)

Re: race condition when writing pg_control

On 5/21/20, 9:52 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

Here's a version with a commit message added. I'll push this to all
releases in a day or two if there are no objections.

Looks good to me. Thanks!

Nathan

Bossart, Nathan

bossartn@amazon.com

over 5 years ago

In reply to: Michael Paquier (#5)

1 attachment(s)

Re: race condition when writing pg_control

On 5/22/20, 10:40 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote:

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Yeah. It also checks the control file values, implying that we should
have LW_SHARED taken at least at the beginning, but this lock cannot
be upgraded we need LW_EXCLUSIVE the whole time. I am wondering if we
should check with an assert if ControlFileLock is taken when going
through UpdateControlFile(). We have one code path at the beginning
of redo where we don't need a lock close to the backup_label file
checks, but we could just pass down a boolean flag to the routine to
handle that case. Another good thing in having an assert is that any
new caller of UpdateControlFile() would need to think about the need
of a lock.

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

Nathan

Attachments:

v1-0001-Assert-that-ControlFileLock-is-held-exclusively-i.patchapplication/octet-stream; name=v1-0001-Assert-that-ControlFileLock-is-held-exclusively-i.patchDownload

From 99341ba250aae7186e3a974d2d5fa035ac0dc625 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Tue, 26 May 2020 19:26:45 +0000
Subject: [PATCH v1 1/1] Assert that ControlFileLock is held exclusively in
 UpdateControlFile().

---
 src/backend/access/transam/xlog.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..f71f6a287b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4900,6 +4900,7 @@ ReadControlFile(void)
 void
 UpdateControlFile(void)
 {
+	Assert(LWLockHeldByMeInMode(ControlFileLock, LW_EXCLUSIVE));
 	update_controlfile(DataDir, ControlFile, true);
 }
 
@@ -6955,8 +6956,14 @@ StartupXLOG(void)
 			}
 		}
 		ControlFile->time = (pg_time_t) time(NULL);
-		/* No need to hold ControlFileLock yet, we aren't up far enough */
+
+		/*
+		 * We aren't up far enough yet to need the ControlFileLock, but we take
+		 * it anyway to satisfy the assertion in UpdateControlFile().
+		 */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
 
 		/*
 		 * Initialize our local copy of minRecoveryPoint.  When doing crash
@@ -9743,6 +9750,8 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_wal_senders = max_wal_senders;
@@ -9752,6 +9761,8 @@ XLogReportParameters(void)
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
 		UpdateControlFile();
+
+		LWLockRelease(ControlFileLock);
 	}
 }
 
-- 
2.16.6

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Bossart, Nathan (#7)

Re: race condition when writing pg_control

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
But you are right as all of this happens in the startup process, so
your patch looks right to me here.
--
Michael

Fujii Masao

masao.fujii@oss.nttdata.com

over 5 years ago

In reply to: Michael Paquier (#8)

Re: race condition when writing pg_control

On 2020/05/27 16:10, Michael Paquier wrote:

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

I have no strong opinion about this, but I tend to agree with Michael here.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
But you are right as all of this happens in the startup process, so
your patch looks right to me here.

LGTM.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#10

Bossart, Nathan

bossartn@amazon.com

over 5 years ago

In reply to: Fujii Masao (#9)

3 attachment(s)

Re: race condition when writing pg_control

On 5/29/20, 12:24 AM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote:

On 2020/05/27 16:10, Michael Paquier wrote:

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

I have no strong opinion about this, but I tend to agree with Michael here.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
But you are right as all of this happens in the startup process, so
your patch looks right to me here.

LGTM.

Thanks for the feedback. I've attached a new set of patches.

Nathan

Attachments:

0003-Assert-that-ControlFileLock-is-held-within-UpdateCon.patchapplication/octet-stream; name=0003-Assert-that-ControlFileLock-is-held-within-UpdateCon.patchDownload

From edc97ec1728c2d05f5d6a0bc6b3a7d665b4b8e6e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 31 May 2020 20:48:26 +0000
Subject: [PATCH 3/3] Assert that ControlFileLock is held within
 UpdateControlFile().

Most callers of UpdateControlFile() should acquire ControlFileLock
exclusively beforehand.  This change adds an assertion that the
lock is held.  Callers that do not require the lock can pass a
boolean flag to bypass the assertion.

Back-patch to all supported releases.

Author: Nathan Bossart, based on a suggestion from Michael Paquier
Reviewed-by: Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
---
 src/backend/access/transam/xlog.c | 34 ++++++++++++++++++++--------------
 src/include/access/xlog.h         |  2 +-
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc..fe23515e81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2820,7 +2820,7 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 		{
 			ControlFile->minRecoveryPoint = newMinRecoveryPoint;
 			ControlFile->minRecoveryPointTLI = newMinRecoveryPointTLI;
-			UpdateControlFile();
+			UpdateControlFile(true);
 			minRecoveryPoint = newMinRecoveryPoint;
 			minRecoveryPointTLI = newMinRecoveryPointTLI;
 
@@ -4434,7 +4434,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				 */
 				updateMinRecoveryPoint = true;
 
-				UpdateControlFile();
+				UpdateControlFile(true);
 
 				/*
 				 * We update SharedRecoveryState while holding the lock on
@@ -4896,10 +4896,16 @@ ReadControlFile(void)
 /*
  * Utility wrapper to update the control file.  Note that the control
  * file gets flushed.
+ *
+ * Most callers should acquire ControlFileLock exclusively before calling
+ * this function.  Callers that can safely use this function without holding
+ * the lock should set holds_lock to false.  Otherwise, holds_lock should be
+ * set to true.
  */
 void
-UpdateControlFile(void)
+UpdateControlFile(bool holds_lock)
 {
+	Assert(!holds_lock || LWLockHeldByMeInMode(ControlFileLock, LW_EXCLUSIVE));
 	update_controlfile(DataDir, ControlFile, true);
 }
 
@@ -6956,7 +6962,7 @@ StartupXLOG(void)
 		}
 		ControlFile->time = (pg_time_t) time(NULL);
 		/* No need to hold ControlFileLock yet, we aren't up far enough */
-		UpdateControlFile();
+		UpdateControlFile(false);
 
 		/*
 		 * Initialize our local copy of minRecoveryPoint.  When doing crash
@@ -7940,7 +7946,7 @@ StartupXLOG(void)
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
-	UpdateControlFile();
+	UpdateControlFile(true);
 	LWLockRelease(ControlFileLock);
 
 	/*
@@ -8007,7 +8013,7 @@ CheckRecoveryConsistency(void)
 		ControlFile->backupStartPoint = InvalidXLogRecPtr;
 		ControlFile->backupEndPoint = InvalidXLogRecPtr;
 		ControlFile->backupEndRequired = false;
-		UpdateControlFile();
+		UpdateControlFile(true);
 
 		LWLockRelease(ControlFileLock);
 	}
@@ -8762,7 +8768,7 @@ CreateCheckPoint(int flags)
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->state = DB_SHUTDOWNING;
 		ControlFile->time = (pg_time_t) time(NULL);
-		UpdateControlFile();
+		UpdateControlFile(true);
 		LWLockRelease(ControlFileLock);
 	}
 
@@ -9044,7 +9050,7 @@ CreateCheckPoint(int flags)
 	ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
 	SpinLockRelease(&XLogCtl->ulsn_lck);
 
-	UpdateControlFile();
+	UpdateControlFile(true);
 	LWLockRelease(ControlFileLock);
 
 	/* Update shared-memory copy of checkpoint XID/epoch */
@@ -9153,7 +9159,7 @@ CreateEndOfRecoveryRecord(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 	ControlFile->minRecoveryPoint = recptr;
 	ControlFile->minRecoveryPointTLI = ThisTimeLineID;
-	UpdateControlFile();
+	UpdateControlFile(true);
 	LWLockRelease(ControlFileLock);
 
 	END_CRIT_SECTION();
@@ -9304,7 +9310,7 @@ CreateRestartPoint(int flags)
 			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
 			ControlFile->time = (pg_time_t) time(NULL);
-			UpdateControlFile();
+			UpdateControlFile(true);
 			LWLockRelease(ControlFileLock);
 		}
 		LWLockRelease(CheckpointLock);
@@ -9386,7 +9392,7 @@ CreateRestartPoint(int flags)
 		}
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
 			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-		UpdateControlFile();
+		UpdateControlFile(true);
 	}
 	LWLockRelease(ControlFileLock);
 
@@ -9753,7 +9759,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
-		UpdateControlFile();
+		UpdateControlFile(true);
 
 		LWLockRelease(ControlFileLock);
 	}
@@ -10142,7 +10148,7 @@ xlog_redo(XLogReaderState *record)
 			}
 			ControlFile->backupStartPoint = InvalidXLogRecPtr;
 			ControlFile->backupEndRequired = false;
-			UpdateControlFile();
+			UpdateControlFile(true);
 
 			LWLockRelease(ControlFileLock);
 		}
@@ -10186,7 +10192,7 @@ xlog_redo(XLogReaderState *record)
 								ControlFile->track_commit_timestamp);
 		ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
 
-		UpdateControlFile();
+		UpdateControlFile(true);
 		LWLockRelease(ControlFileLock);
 
 		/* Check to see if any parameter change gives a problem on recovery */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..bf1392e6e5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -312,7 +312,7 @@ extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
 
-extern void UpdateControlFile(void);
+extern void UpdateControlFile(bool holds_lock);
 extern uint64 GetSystemIdentifier(void);
 extern char *GetMockAuthenticationNonce(void);
 extern bool DataChecksumsEnabled(void);
-- 
2.16.6

0002-Acquire-ControlFileLock-within-XLogReportParameters.patchapplication/octet-stream; name=0002-Acquire-ControlFileLock-within-XLogReportParameters.patchDownload

From cd4bd8db0b73b445d061f38a09cf377083666062 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 31 May 2020 19:47:36 +0000
Subject: [PATCH 2/3] Acquire ControlFileLock within XLogReportParameters().

XLogReportParameters() must acquire ControlFileLock before
modifying ControlFile and calling UpdateControlFile(), or the
checkpointer could write out a control file with a bad checksum.

Back-patch to all supported releases.

Reported-by: Fujii Masao
Author: Nathan Bossart
Reviewed-by: Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
---
 src/backend/access/transam/xlog.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 274b808476..55cac186dc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9743,6 +9743,8 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_wal_senders = max_wal_senders;
@@ -9752,6 +9754,8 @@ XLogReportParameters(void)
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
 		UpdateControlFile();
+
+		LWLockRelease(ControlFileLock);
 	}
 }
 
-- 
2.16.6

0001-Fix-race-condition-that-could-corrupt-pg_control.patchapplication/octet-stream; name=0001-Fix-race-condition-that-could-corrupt-pg_control.patchDownload

From 824d2b622eae134cec8f732a29bac69c83f03484 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 22 May 2020 16:23:59 +1200
Subject: [PATCH 1/3] Fix race condition that could corrupt pg_control.

The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire
ControlFileLock before modifying ControlFile->checkPointCopy, or the
checkpointer could write out a control file with a bad checksum.

Back-patch to all supported releases.

Author: Nathan Bossart
Reviewed-by: Fujii Masao, Thomas Munro
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
---
 src/backend/access/transam/xlog.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..274b808476 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9976,7 +9976,9 @@ xlog_redo(XLogReaderState *record)
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
@@ -10033,7 +10035,9 @@ xlog_redo(XLogReaderState *record)
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
-- 
2.16.6

#11

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Bossart, Nathan (#10)

Re: race condition when writing pg_control

On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote:

Thanks for the feedback. I've attached a new set of patches.

Thanks for splitting the set. 0001 and 0002 are the minimum set for
back-patching, and it would be better to merge them together. 0003 is
debatable and not an actual bug fix, so I would refrain from doing a
backpatch. It does not seem that there is a strong consensus in favor
of 0003 either.

Thomas, are you planning to look at this patch set?
--
Michael

#12

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Michael Paquier (#11)

Re: race condition when writing pg_control

On Tue, Jun 2, 2020 at 5:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote:

Thanks for the feedback. I've attached a new set of patches.

Thanks for splitting the set. 0001 and 0002 are the minimum set for
back-patching, and it would be better to merge them together. 0003 is
debatable and not an actual bug fix, so I would refrain from doing a
backpatch. It does not seem that there is a strong consensus in favor
of 0003 either.

Thomas, are you planning to look at this patch set?

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

#13

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Thomas Munro (#12)

Re: race condition when writing pg_control

On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote:

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

Thanks!
--
Michael

#14

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Michael Paquier (#13)

Re: race condition when writing pg_control

On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote:

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

Thanks!

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

#15

Bossart, Nathan

bossartn@amazon.com

over 5 years ago

In reply to: Thomas Munro (#14)

Re: race condition when writing pg_control

On 6/7/20, 7:50 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

Thanks! The approach for 0003 was discussed a bit upthread [0]/messages/by-id/20200527071053.GD103662@paquier.xyz. I do
not have a strong opinion, but I lean towards just acquiring the lock.

Nathan

[0]: /messages/by-id/20200527071053.GD103662@paquier.xyz

#16

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Bossart, Nathan (#15)

Re: race condition when writing pg_control

On Mon, Jun 08, 2020 at 03:25:31AM +0000, Bossart, Nathan wrote:

On 6/7/20, 7:50 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

Thanks! The approach for 0003 was discussed a bit upthread [0]. I do
not have a strong opinion, but I lean towards just acquiring the lock.

Fujii-san has provided an answer upthread, that can maybe translated
as a +0.3~0.4:
/messages/by-id/fc796148-7d63-47bb-e91d-e09b62a502e9@oss.nttdata.com

FWIW, I'd rather not take the lock as that's not necessary and just
add the parameter if I were to do it. Now I would be fine as well to
just take the lock if you decide that's more simple, as long as we add
this new assertion as a safety net for future changes.
--
Michael

#17

amul sul

sulamul@gmail.com

over 5 years ago

In reply to: Fujii Masao (#9)

Re: race condition when writing pg_control

On Fri, May 29, 2020 at 12:54 PM Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

On 2020/05/27 16:10, Michael Paquier wrote:

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

I have no strong opinion about this, but I tend to agree with Michael here.

I too don't have a strong opinion about this either, but I like Nathan's
approach more, just take the lock in the startup process as well for the
simplicity if that is not hurting much. I think, apart from the startup
process we
have to take the lock to update the control file, then having separate
treatment
for the startup process looks confusing to me, IMHO.

Regards,
Amul

#18

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Thomas Munro (#14)

Re: race condition when writing pg_control

On Sun, Jun 7, 2020 at 10:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote:

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

Thanks!

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

Today, after committing a3e6c6f, I saw recovery/018_wal_optimize.pl
fail and see this message in the replica log [2]https://api.cirrus-ci.com/v1/artifact/task/4626725689098240/testrun/build/testrun/recovery/018_wal_optimize/log/018_wal_optimize_node_replica.log.

2024-05-16 15:12:22.821 GMT [5440][not initialized] FATAL: incorrect
checksum in control file

I'm pretty sure it's not related to my commit. So, I was looking for
existing reports of this error message.

It's a long shot, since 0001 and 0002 were already pushed, but this is
the only recent report I could find of "FATAL: incorrect checksum in
control file" in pgsql-hackers or bugs archives.

I do see this thread from 2016 [3]/messages/by-id/CAEepm=0hh_Dvd2Q+fcjYpkVzSoNX2+f167cYu5nwu=qh5HZhJw@mail.gmail.com which might be relevant because the
reported bug was also on Windows.

- Melanie

[1]: https://cirrus-ci.com/task/4626725689098240
[2]: https://api.cirrus-ci.com/v1/artifact/task/4626725689098240/testrun/build/testrun/recovery/018_wal_optimize/log/018_wal_optimize_node_replica.log
[3]: /messages/by-id/CAEepm=0hh_Dvd2Q+fcjYpkVzSoNX2+f167cYu5nwu=qh5HZhJw@mail.gmail.com

#19

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Melanie Plageman (#18)

Re: race condition when writing pg_control

On Thu, May 16, 2024 at 12:19:22PM -0400, Melanie Plageman wrote:

Today, after committing a3e6c6f, I saw recovery/018_wal_optimize.pl
fail and see this message in the replica log [2].

2024-05-16 15:12:22.821 GMT [5440][not initialized] FATAL: incorrect
checksum in control file

I'm pretty sure it's not related to my commit. So, I was looking for
existing reports of this error message.

Yeah, I don't see how it could be related.

It's a long shot, since 0001 and 0002 were already pushed, but this is
the only recent report I could find of "FATAL: incorrect checksum in
control file" in pgsql-hackers or bugs archives.

I do see this thread from 2016 [3] which might be relevant because the
reported bug was also on Windows.

I suspect it will be difficult to investigate this one too much further
unless we can track down a copy of the control file with the bad checksum.
Other than searching for any new code that isn't doing the appropriate
locking, maybe we could search the buildfarm for any other occurrences. I
also seem some threads concerning whether the way we are reading/writing
the control file is atomic.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#20

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Nathan Bossart (#19)

Re: race condition when writing pg_control

Nathan Bossart <nathandbossart@gmail.com> writes:

I suspect it will be difficult to investigate this one too much further
unless we can track down a copy of the control file with the bad checksum.
Other than searching for any new code that isn't doing the appropriate
locking, maybe we could search the buildfarm for any other occurrences. I
also seem some threads concerning whether the way we are reading/writing
the control file is atomic.

The intention was certainly always that it be atomic. If it isn't
we have got *big* trouble.

regards, tom lane

#21

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Tom Lane (#20)

Re: race condition when writing pg_control

Hi,

On 2024-05-16 14:50:50 -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I suspect it will be difficult to investigate this one too much further
unless we can track down a copy of the control file with the bad checksum.
Other than searching for any new code that isn't doing the appropriate
locking, maybe we could search the buildfarm for any other occurrences. I
also seem some threads concerning whether the way we are reading/writing
the control file is atomic.

The intention was certainly always that it be atomic. If it isn't
we have got *big* trouble.

We unfortunately do *know* that on several systems e.g. basebackup can read a
partially written control file, while the control file is being
updated. Thomas addressed this partially for frontend code, but not yet for
backend code. See
/messages/by-id/CA+hUKGLhLGCV67NuTiE=etdcw5ChMkYgpgFsa9PtrXm-984FYA@mail.gmail.com

Greetings,

Andres Freund

#22

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Andres Freund (#21)

Re: race condition when writing pg_control

Andres Freund <andres@anarazel.de> writes:

On 2024-05-16 14:50:50 -0400, Tom Lane wrote:

The intention was certainly always that it be atomic. If it isn't
we have got *big* trouble.

We unfortunately do *know* that on several systems e.g. basebackup can read a
partially written control file, while the control file is being
updated.

Yeah, but can't we just retry that if we get a bad checksum?

What had better be atomic is the write to disk. Systems that can't
manage POSIX semantics for concurrent reads and writes are annoying,
but not fatal ...

regards, tom lane

#23

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Tom Lane (#22)

Re: race condition when writing pg_control

Hi,

On 2024-05-16 15:01:31 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2024-05-16 14:50:50 -0400, Tom Lane wrote:

The intention was certainly always that it be atomic. If it isn't
we have got *big* trouble.

We unfortunately do *know* that on several systems e.g. basebackup can read a
partially written control file, while the control file is being
updated.

Yeah, but can't we just retry that if we get a bad checksum?

Retry what/where precisely? We can avoid the issue in basebackup.c by taking
ControlFileLock in the right moment - but that doesn't address
pg_start/stop_backup based backups. Hence the patch in the referenced thread
moving to replacing the control file by atomic-rename if there are base
backups ongoing.

What had better be atomic is the write to disk.

That is still true to my knowledge.

Systems that can't manage POSIX semantics for concurrent reads and writes
are annoying, but not fatal ...

I think part of the issue that people don't agree what posix says about
a read that's concurrent to a write... See e.g.
https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic

Greetings,

Andres Freund

#24

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Andres Freund (#23)

Re: race condition when writing pg_control

The specific problem here is that LocalProcessControlFile() runs in
every launched child for EXEC_BACKEND builds. Windows uses
EXEC_BACKEND, and Windows' NTFS file system is one of the two file
systems known to this list to have the concurrent read/write data
mashing problem (the other being ext4).

#25

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#24)

1 attachment(s)

Re: race condition when writing pg_control

On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The specific problem here is that LocalProcessControlFile() runs in
every launched child for EXEC_BACKEND builds. Windows uses
EXEC_BACKEND, and Windows' NTFS file system is one of the two file
systems known to this list to have the concurrent read/write data
mashing problem (the other being ext4).

Phngh... this is surprisingly difficult to fix.

Things that don't work: We "just" need to acquire ControlFileLock
while reading the file or examining the object in shared memory, or
get a copy of it, passed through the EXEC_BACKEND BackendParameters
that was acquired while holding the lock, but the current location of
this code in child startup is too early to use LWLocks, and the
postmaster can't acquire locks either so it can't even safely take a
copy to pass on. You could reorder startup so that we are allowed to
acquire LWLocks in children at that point, but then you'd need to
convince yourself that there is no danger of breaking some ordering
requirement in external preload libraries, and figure out what to do
about children that don't even attach to shared memory. Maybe that's
possible, but that doesn't sound like a good idea to back-patch.

First idea idea I've come up with to avoid all of that: pass a copy of
the "proto-controlfile", to coin a term for the one read early in
postmaster startup by LocalProcessControlFile(). As far as I know,
the only reason we need it is to suck some settings out of it that
don't change while a cluster is running (mostly can't change after
initdb, and checksums can only be {en,dis}abled while down). Right?
Children can just "import" that sucker instead of calling
LocalProcessControlFile() to figure out the size of WAL segments yada
yada, I think? Later they will attach to the real one in shared
memory for all future purposes, once normal interlocking is allowed.

I dunno. Draft patch attached. Better plans welcome. This passes CI
on Linux systems afflicted by EXEC_BACKEND, and Windows. Thoughts?

Attachments:

v1-0001-Fix-pg_control-corruption-in-EXEC_BACKEND-startup.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-pg_control-corruption-in-EXEC_BACKEND-startup.patchDownload

From 48c2de14bd9368b4708a99ecbb75452dc327e608 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 18 May 2024 13:41:09 +1200
Subject: [PATCH v1 1/3] Fix pg_control corruption in EXEC_BACKEND startup.

When backend processes were launched in EXEC_BACKEND builds, they would
run LocalProcessControlFile() to read in pg_control and extract several
important settings.

This happens too early to acquire ControlFileLock, and the postmaster is
also not allowed to acquire ControlFileLock, so it can't safely take a
copy to give to the child.

Instead, pass down the "proto-controlfile" that was read by the
postmaster in LocalProcessControlFile().  Introduce functions
ExportProtoControlFile() and ImportProtoControlFile() to allow that.
Subprocesses will extract information from that, and then later attach
to the current control file in shared memory.

Reported-by: Melanie Plageman <melanieplageman@gmail.com> per Windows CI failure
Discussion: https://postgr.es/m/CAAKRu_YNGwEYrorQYza_W8tU%2B%3DtoXRHG8HpyHC-KDbZqA_ZVSA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 46 +++++++++++++++++++++++--
 src/backend/postmaster/launch_backend.c | 19 ++++++----
 src/include/access/xlog.h               |  5 +++
 3 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f2..b69a0d95af9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -568,6 +568,10 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;
 
+#ifdef EXEC_BACKEND
+static ControlFileData *ProtoControlFile = NULL;
+#endif
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -686,6 +690,7 @@ static bool PerformRecoveryXLogAction(void);
 static void InitControlFile(uint64 sysidentifier);
 static void WriteControlFile(void);
 static void ReadControlFile(void);
+static void ScanControlFile(void);
 static void UpdateControlFile(void);
 static char *str_time(pg_time_t tnow);
 
@@ -4309,9 +4314,7 @@ WriteControlFile(void)
 static void
 ReadControlFile(void)
 {
-	pg_crc32c	crc;
 	int			fd;
-	static char wal_segsz_str[20];
 	int			r;
 
 	/*
@@ -4344,6 +4347,15 @@ ReadControlFile(void)
 
 	close(fd);
 
+	ScanControlFile();
+}
+
+static void
+ScanControlFile(void)
+{
+	static char wal_segsz_str[20];
+	pg_crc32c	crc;
+
 	/*
 	 * Check for expected pg_control format version.  If this is wrong, the
 	 * CRC check will likely fail because we'll be checking the wrong number
@@ -4815,8 +4827,33 @@ LocalProcessControlFile(bool reset)
 	Assert(reset || ControlFile == NULL);
 	ControlFile = palloc(sizeof(ControlFileData));
 	ReadControlFile();
+
+#ifdef EXEC_BACKEND
+	/* We need to be able to give this to subprocesses. */
+	ProtoControlFile = ControlFile;
+#endif
+}
+
+#ifdef EXEC_BACKEND
+void
+ExportProtoControlFile(ControlFileData *copy)
+{
+	*copy = *ProtoControlFile;
 }
 
+/*
+ * Like LocalProcessControlFile(), but used early in EXEC_BACKEND children's
+ * startup.  This receives the same file that the postmaster first read.
+ */
+void
+ImportProtoControlFile(const ControlFileData *copy)
+{
+	ControlFile = palloc(sizeof(ControlFileData));
+	*ControlFile = *copy;
+	ScanControlFile();
+}
+#endif
+
 /*
  * Get the wal_level from the control file. For a standby, this value should be
  * considered as its active wal_level, because it may be different from what
@@ -4935,7 +4972,12 @@ XLOGShmemInit(void)
 	if (localControlFile)
 	{
 		memcpy(ControlFile, localControlFile, sizeof(ControlFileData));
+#ifdef EXEC_BACKEND
+		/* We still hold a reference to give to subprocesses. */
+		Assert(ProtoControlFile == localControlFile);
+#else
 		pfree(localControlFile);
+#endif
 	}
 
 	/*
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index bdfa238e4fe..40efe738ae1 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "catalog/pg_control.h"
 #include "common/file_utils.h"
 #include "libpq/libpq-be.h"
 #include "libpq/pqsignal.h"
@@ -135,6 +136,14 @@ typedef struct
 	char		my_exec_path[MAXPGPATH];
 	char		pkglib_path[MAXPGPATH];
 
+	/*
+	 * A copy of the ControlFileData from early in Postmaster startup.  We
+	 * need to access its contents it at a phase of initialization before we
+	 * are allowed to acquire LWLocks, so we can't just use shared memory or
+	 * read the file from disk.
+	 */
+	ControlFileData proto_controlfile;
+
 	/*
 	 * These are only used by backend processes, but are here because passing
 	 * a socket needs some special handling on Windows. 'client_sock' is an
@@ -643,12 +652,6 @@ SubPostmasterMain(int argc, char *argv[])
 	 */
 	checkDataDir();
 
-	/*
-	 * (re-)read control file, as it contains config. The postmaster will
-	 * already have read this, but this process doesn't know about that.
-	 */
-	LocalProcessControlFile(false);
-
 	/*
 	 * Reload any libraries that were preloaded by the postmaster.  Since we
 	 * exec'd this process, those libraries didn't come along with us; but we
@@ -746,6 +749,8 @@ save_backend_variables(BackendParameters *param, ClientSocket *client_sock,
 
 	param->MaxBackends = MaxBackends;
 
+	ExportProtoControlFile(&param->proto_controlfile);
+
 #ifdef WIN32
 	param->PostmasterHandle = PostmasterHandle;
 	if (!write_duplicated_handle(&param->initial_signal_pipe,
@@ -1018,6 +1023,8 @@ restore_backend_variables(BackendParameters *param)
 
 	strlcpy(pkglib_path, param->pkglib_path, MAXPGPATH);
 
+	ImportProtoControlFile(&param->proto_controlfile);
+
 	/*
 	 * We need to restore fd.c's counts of externally-opened FDs; to avoid
 	 * confusion, be sure to do this after restoring max_safe_fds.  (Note:
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1a1f11a943f..853f3527ef1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -195,6 +195,7 @@ typedef enum WALAvailability
 
 struct XLogRecData;
 struct XLogReaderState;
+struct ControlFileData;
 
 extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 								   XLogRecPtr fpw_lsn,
@@ -234,6 +235,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+#ifdef EXEC_BACKEND
+extern void ExportProtoControlFile(struct ControlFileData *copy);
+extern void ImportProtoControlFile(const struct ControlFileData *copy);
+#endif
 extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
-- 
2.39.2

#26

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Thomas Munro (#25)

Re: race condition when writing pg_control

On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:

On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The specific problem here is that LocalProcessControlFile() runs in
every launched child for EXEC_BACKEND builds. Windows uses
EXEC_BACKEND, and Windows' NTFS file system is one of the two file
systems known to this list to have the concurrent read/write data
mashing problem (the other being ext4).

First idea idea I've come up with to avoid all of that: pass a copy of
the "proto-controlfile", to coin a term for the one read early in
postmaster startup by LocalProcessControlFile(). As far as I know,
the only reason we need it is to suck some settings out of it that
don't change while a cluster is running (mostly can't change after
initdb, and checksums can only be {en,dis}abled while down). Right?
Children can just "import" that sucker instead of calling
LocalProcessControlFile() to figure out the size of WAL segments yada
yada, I think? Later they will attach to the real one in shared
memory for all future purposes, once normal interlocking is allowed.

I like that strategy, particularly because it recreates what !EXEC_BACKEND
backends inherit from the postmaster. It might prevent future bugs that would
have been specific to EXEC_BACKEND.

I dunno. Draft patch attached. Better plans welcome. This passes CI
on Linux systems afflicted by EXEC_BACKEND, and Windows. Thoughts?

Looks reasonable. I didn't check over every detail, given the draft status.

#27

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Noah Misch (#26)

Re: race condition when writing pg_control

On Fri, Jul 12, 2024 at 11:43 PM Noah Misch <noah@leadboat.com> wrote:

On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:

On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The specific problem here is that LocalProcessControlFile() runs in
every launched child for EXEC_BACKEND builds. Windows uses
EXEC_BACKEND, and Windows' NTFS file system is one of the two file
systems known to this list to have the concurrent read/write data
mashing problem (the other being ext4).

First idea idea I've come up with to avoid all of that: pass a copy of
the "proto-controlfile", to coin a term for the one read early in
postmaster startup by LocalProcessControlFile(). As far as I know,
the only reason we need it is to suck some settings out of it that
don't change while a cluster is running (mostly can't change after
initdb, and checksums can only be {en,dis}abled while down). Right?
Children can just "import" that sucker instead of calling
LocalProcessControlFile() to figure out the size of WAL segments yada
yada, I think? Later they will attach to the real one in shared
memory for all future purposes, once normal interlocking is allowed.

I like that strategy, particularly because it recreates what !EXEC_BACKEND
backends inherit from the postmaster. It might prevent future bugs that would
have been specific to EXEC_BACKEND.

Thanks for looking! Yeah, that is a good way to put it.

The only other idea I can think of is that the Postmaster could take
all of the things that LocalProcessControlFile() wants to extract from
the file, and transfer them via that struct used for EXEC_BACKEND as
individual variables, instead of this new proto-controlfile copy. I
think it would be a bigger change with no obvious-to-me additional
benefit, so I didn't try it.

I dunno. Draft patch attached. Better plans welcome. This passes CI
on Linux systems afflicted by EXEC_BACKEND, and Windows. Thoughts?

Looks reasonable. I didn't check over every detail, given the draft status.

I'm going to upgrade this to a proposal:

https://commitfest.postgresql.org/49/5124/

I wonder how often this happens in the wild.

#28

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Thomas Munro (#27)

Re: race condition when writing pg_control

Hello Thomas,

15.07.2024 06:44, Thomas Munro wrote:

I'm going to upgrade this to a proposal:

https://commitfest.postgresql.org/49/5124/

I wonder how often this happens in the wild.

Please look at a recent failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-07-24%2004%3A08%3A23, produced by buildfarm animal
culicidae, which tests EXEC_BACKEND. I guess it's caused by the issue
discussed.

Maybe it would make sense to construct a reliable reproducer for the
issue (I could not find a ready-to-use recipe in this thread)...

What do you think?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-07-24%2004%3A08%3A23

Best regards,
Alexander

#29

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Thomas Munro (#27)

Re: race condition when writing pg_control

On Mon, Jul 15, 2024 at 03:44:48PM +1200, Thomas Munro wrote:

On Fri, Jul 12, 2024 at 11:43 PM Noah Misch <noah@leadboat.com> wrote:

On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:

On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The specific problem here is that LocalProcessControlFile() runs in
every launched child for EXEC_BACKEND builds. Windows uses
EXEC_BACKEND, and Windows' NTFS file system is one of the two file
systems known to this list to have the concurrent read/write data
mashing problem (the other being ext4).

First idea idea I've come up with to avoid all of that: pass a copy of
the "proto-controlfile", to coin a term for the one read early in
postmaster startup by LocalProcessControlFile(). As far as I know,
the only reason we need it is to suck some settings out of it that
don't change while a cluster is running (mostly can't change after
initdb, and checksums can only be {en,dis}abled while down). Right?
Children can just "import" that sucker instead of calling
LocalProcessControlFile() to figure out the size of WAL segments yada
yada, I think? Later they will attach to the real one in shared
memory for all future purposes, once normal interlocking is allowed.

I like that strategy, particularly because it recreates what !EXEC_BACKEND
backends inherit from the postmaster. It might prevent future bugs that would
have been specific to EXEC_BACKEND.

Thanks for looking! Yeah, that is a good way to put it.

Oops, the way I put it turned out to be false. Postmaster has ControlFile
pointing to shared memory before forking backends, so !EXEC_BACKEND children
are born that way. In the postmaster, ControlFile->checkPointCopy->redo does
change after each checkpoint.

The only other idea I can think of is that the Postmaster could take
all of the things that LocalProcessControlFile() wants to extract from
the file, and transfer them via that struct used for EXEC_BACKEND as
individual variables, instead of this new proto-controlfile copy. I
think it would be a bigger change with no obvious-to-me additional
benefit, so I didn't try it.

Yeah, that would be more future-proof but a bigger change. One could argue
for a yet-larger refactor so even the !EXEC_BACKEND case doesn't read those
fields from ControlFile memory. Then we could get rid of ControlFile ever
being set to something other than NULL or a shmem pointer. ControlFileData's
mix of initdb-time fields, postmaster-start-time fields, and changes-anytime
fields is inconvenient here.

The unknown is the value of that future proofing. Much EXEC_BACKEND early
startup code is shared with postmaster startup, which can assume it's the only
process. I can't rule out a future bug where that shared code does a read
that's harmless in postmaster startup but harmful when an EXEC_BACKEND child
runs the same read. For a changes-anytime field, the code would already be
subtly buggy in EXEC_BACKEND today, since it would be reading without an
LWLock. For a postmaster-start-time field, things should be okay so long as
we capture the proto ControlFileData after the last change to such fields.
That invariant is not trivial to achieve, but it's not gravely hard either.

A possible middle option would be to use the proto control file, but
explicitly set its changes-anytime fields to bogus values.

What's your preference? I don't think any of these would be bad decisions.
It could be clearer after enumerating how many ControlFile fields get used
this early.

#30

vignesh C

vignesh21@gmail.com

10 months ago

In reply to: Noah Misch (#29)

Re: race condition when writing pg_control

On Wed, 31 Jul 2024 at 05:25, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jul 15, 2024 at 03:44:48PM +1200, Thomas Munro wrote:

On Fri, Jul 12, 2024 at 11:43 PM Noah Misch <noah@leadboat.com> wrote:

On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:

On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The specific problem here is that LocalProcessControlFile() runs in
every launched child for EXEC_BACKEND builds. Windows uses
EXEC_BACKEND, and Windows' NTFS file system is one of the two file
systems known to this list to have the concurrent read/write data
mashing problem (the other being ext4).

First idea idea I've come up with to avoid all of that: pass a copy of
the "proto-controlfile", to coin a term for the one read early in
postmaster startup by LocalProcessControlFile(). As far as I know,
the only reason we need it is to suck some settings out of it that
don't change while a cluster is running (mostly can't change after
initdb, and checksums can only be {en,dis}abled while down). Right?
Children can just "import" that sucker instead of calling
LocalProcessControlFile() to figure out the size of WAL segments yada
yada, I think? Later they will attach to the real one in shared
memory for all future purposes, once normal interlocking is allowed.

I like that strategy, particularly because it recreates what !EXEC_BACKEND
backends inherit from the postmaster. It might prevent future bugs that would
have been specific to EXEC_BACKEND.

Thanks for looking! Yeah, that is a good way to put it.

Oops, the way I put it turned out to be false. Postmaster has ControlFile
pointing to shared memory before forking backends, so !EXEC_BACKEND children
are born that way. In the postmaster, ControlFile->checkPointCopy->redo does
change after each checkpoint.

The only other idea I can think of is that the Postmaster could take
all of the things that LocalProcessControlFile() wants to extract from
the file, and transfer them via that struct used for EXEC_BACKEND as
individual variables, instead of this new proto-controlfile copy. I
think it would be a bigger change with no obvious-to-me additional
benefit, so I didn't try it.

Yeah, that would be more future-proof but a bigger change. One could argue
for a yet-larger refactor so even the !EXEC_BACKEND case doesn't read those
fields from ControlFile memory. Then we could get rid of ControlFile ever
being set to something other than NULL or a shmem pointer. ControlFileData's
mix of initdb-time fields, postmaster-start-time fields, and changes-anytime
fields is inconvenient here.

The unknown is the value of that future proofing. Much EXEC_BACKEND early
startup code is shared with postmaster startup, which can assume it's the only
process. I can't rule out a future bug where that shared code does a read
that's harmless in postmaster startup but harmful when an EXEC_BACKEND child
runs the same read. For a changes-anytime field, the code would already be
subtly buggy in EXEC_BACKEND today, since it would be reading without an
LWLock. For a postmaster-start-time field, things should be okay so long as
we capture the proto ControlFileData after the last change to such fields.
That invariant is not trivial to achieve, but it's not gravely hard either.

A possible middle option would be to use the proto control file, but
explicitly set its changes-anytime fields to bogus values.

What's your preference?

@Thomas Munro , others - If anyone has suggestions on which approach
might work best, please share them to help move this discussion
forward.

Regards,
Vignesh