InstallXLogFileSegment() vs concurrent WAL flush

Started by Thomas Munroalmost 2 years ago5 messages

thomas.munro@gmail.com

almost 2 years ago

2 attachment(s)

Hi,

New WAL space is created by renaming a file into place. Either a
newly created file with a temporary name or, ideally, a recyclable old
file with a name derived from an old LSN. I think there is a data
loss window between rename() and fsync(parent_directory). A
concurrent backend might open(new_name), write(), fdatasync(), and
then we might lose power before the rename hits the disk. The data
itself would survive the crash, but recovery wouldn't be able to find
and replay it. That might break the log-before-data rule or forget a
transaction that has been reported as committed to a client.

Actual breakage would presumably require really bad luck, and I
haven't seen this happen or anything, it just occurred to me while
reading code, and I can't see any existing defences.

One simple way to address that would be to make XLogFileInitInternal()
wait for InstallXLogFileSegment() to finish. It's a little
pessimistic to do that unconditionally, though, as then you have to
wait even for rename operations for segment files later than the one
you're opening, so I thought about how to skip waiting in that case --
see 0002. I'm not sure if it's worth worrying about or not.

Attachments:

0001-Fix-InstallXLogFileSegment-concurrency-bug.patchapplication/octet-stream; name=0001-Fix-InstallXLogFileSegment-concurrency-bug.patchDownload

From a70c79191567d51fefe71b07f7ef4afabaf0eb84 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 7 Jan 2024 12:40:57 +1300
Subject: [PATCH 1/2] Fix InstallXLogFileSegment() concurrency bug.

If the checkpointer is recycling a WAL segment file or another backend
is installing a new file, we mustn't allow other processes to write data
into the file before its name is durable.  Otherwise the data could
become unreachable in recovery after a power loss.
---
 src/backend/access/transam/xlog.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..d76cfbb66e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3062,7 +3062,20 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 					 errmsg("could not open file \"%s\": %m", path)));
 	}
 	else
+	{
+		/*
+		 * The file is there, but it is possible that InstallXLogFileSegment()
+		 * has recently renamed it and not yet made the new name durable.  We
+		 * don't want to be able to flush data into a file whose name might
+		 * not survive power loss, since it would become unreachable in
+		 * recovery.  Since InstallXlogFileSegment() holds ControlFileLock,
+		 * acquiring it here is enough to wait for any durable_rename() call
+		 * that might have started before we opened the file.
+		 */
+		LWLockAcquire(ControlFileLock, LW_SHARED);
+		LWLockRelease(ControlFileLock);
 		return fd;
+	}
 
 	/*
 	 * Initialize an empty (all zeroes) segment.  NOTE: it is possible that
-- 
2.39.3 (Apple Git-145)

0002-Track-end-of-installed-WAL-space-in-shared-memory.patchapplication/octet-stream; name=0002-Track-end-of-installed-WAL-space-in-shared-memory.patchDownload

From 7459f6521da9a3d8748b932921305db55ed1fc0d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 7 Jan 2024 12:50:46 +1300
Subject: [PATCH 2/2] Track end of installed WAL space in shared memory.

We can often skip taking a lock when opening new WAL segment files if we
track the most recently known installed file.
---
 src/backend/access/transam/xlog.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d76cfbb66e..8e5900e722 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -470,6 +470,8 @@ typedef struct XLogCtlData
 
 	XLogSegNo	lastRemovedSegNo;	/* latest removed/recycled XLOG segment */
 
+	pg_atomic_uint64 last_known_installed_segno;
+
 	/* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
 	XLogRecPtr	unloggedLSN;
 	slock_t		ulsn_lck;
@@ -3071,9 +3073,17 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * recovery.  Since InstallXlogFileSegment() holds ControlFileLock,
 		 * acquiring it here is enough to wait for any durable_rename() call
 		 * that might have started before we opened the file.
+		 *
+		 * We can skip that if we can already see that the WAL space we need
+		 * is fully synchronized.  We may see a slightly out of date value
+		 * since we haven't acquired the lock yet, but that's OK, it just
+		 * means we might take the lock when we don't need to.
 		 */
-		LWLockAcquire(ControlFileLock, LW_SHARED);
-		LWLockRelease(ControlFileLock);
+		if (pg_atomic_read_u64(&XLogCtl->last_known_installed_segno) < logsegno)
+		{
+			LWLockAcquire(ControlFileLock, LW_SHARED);
+			LWLockRelease(ControlFileLock);
+		}
 		return fd;
 	}
 
@@ -3446,6 +3456,8 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		return false;
 	}
 
+	pg_atomic_write_u64(&XLogCtl->last_known_installed_segno, *segno);
+
 	LWLockRelease(ControlFileLock);
 
 	return true;
@@ -4827,6 +4839,8 @@ XLOGShmemInit(void)
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	SpinLockInit(&XLogCtl->ulsn_lck);
+
+	pg_atomic_init_u64(&XLogCtl->last_known_installed_segno, 0);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

Yugo NAGATA

nagata@sraoss.co.jp

almost 2 years ago

In reply to: Thomas Munro (#1)

1 attachment(s)

Re: InstallXLogFileSegment() vs concurrent WAL flush

On Fri, 2 Feb 2024 11:18:18 +0100
Thomas Munro <thomas.munro@gmail.com> wrote:

Hi,

New WAL space is created by renaming a file into place. Either a
newly created file with a temporary name or, ideally, a recyclable old
file with a name derived from an old LSN. I think there is a data
loss window between rename() and fsync(parent_directory). A
concurrent backend might open(new_name), write(), fdatasync(), and
then we might lose power before the rename hits the disk. The data
itself would survive the crash, but recovery wouldn't be able to find
and replay it. That might break the log-before-data rule or forget a
transaction that has been reported as committed to a client.

Actual breakage would presumably require really bad luck, and I
haven't seen this happen or anything, it just occurred to me while
reading code, and I can't see any existing defences.

One simple way to address that would be to make XLogFileInitInternal()
wait for InstallXLogFileSegment() to finish. It's a little

Or, can we make sure the rename is durable by calling fsync before
returning the fd, as a patch attached here?

Regards,
Yugo Nagata

pessimistic to do that unconditionally, though, as then you have to
wait even for rename operations for segment files later than the one
you're opening, so I thought about how to skip waiting in that case --
see 0002. I'm not sure if it's worth worrying about or not.

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

fix_InstallXLogFileSegment_in_another_way.patchtext/x-diff; name=fix_InstallXLogFileSegment_in_another_way.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..862109ca3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3062,7 +3062,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 					 errmsg("could not open file \"%s\": %m", path)));
 	}
 	else
+	{
+		fsync_fname_ext(path, false, false, ERROR);
+		fsync_parent_path(path, ERROR);
 		return fd;
+	}
 
 	/*
 	 * Initialize an empty (all zeroes) segment.  NOTE: it is possible that

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Yugo NAGATA (#2)

Re: InstallXLogFileSegment() vs concurrent WAL flush

On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote:

On Fri, 2 Feb 2024 11:18:18 +0100
Thomas Munro <thomas.munro@gmail.com> wrote:

One simple way to address that would be to make XLogFileInitInternal()
wait for InstallXLogFileSegment() to finish. It's a little

Or, can we make sure the rename is durable by calling fsync before
returning the fd, as a patch attached here?

Right, yeah, that works too. I'm not sure which way is better.

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#3)

Re: InstallXLogFileSegment() vs concurrent WAL flush

At Fri, 2 Feb 2024 14:42:46 +0100, Thomas Munro <thomas.munro@gmail.com> wrote in

On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote:

On Fri, 2 Feb 2024 11:18:18 +0100
Thomas Munro <thomas.munro@gmail.com> wrote:

One simple way to address that would be to make XLogFileInitInternal()
wait for InstallXLogFileSegment() to finish. It's a little

Or, can we make sure the rename is durable by calling fsync before
returning the fd, as a patch attached here?

Right, yeah, that works too. I'm not sure which way is better.

I'm not sure I like issuing spurious syncs unconditionally. Therefore,
I prefer Thomas' approach in that regard. 0002 would be beneficial,
considering the case of a very large max_wal_size, and the code seems
to be the minimal required. I don't think it matters that the lock
attempts occur uselessly until the first segment installation. That
being said, we could avoid it by initializing
last_known_installed_segno properly.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

about 1 year ago

In reply to: Kyotaro Horiguchi (#4)

1 attachment(s)

Re: InstallXLogFileSegment() vs concurrent WAL flush

On Mon, Feb 5, 2024 at 11:58 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Fri, 2 Feb 2024 14:42:46 +0100, Thomas Munro <thomas.munro@gmail.com> wrote in

On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote:

On Fri, 2 Feb 2024 11:18:18 +0100
Thomas Munro <thomas.munro@gmail.com> wrote:

One simple way to address that would be to make XLogFileInitInternal()
wait for InstallXLogFileSegment() to finish. It's a little

Or, can we make sure the rename is durable by calling fsync before
returning the fd, as a patch attached here?

Right, yeah, that works too. I'm not sure which way is better.

I'm not sure I like issuing spurious syncs unconditionally. Therefore,
I prefer Thomas' approach in that regard. 0002 would be beneficial,
considering the case of a very large max_wal_size, and the code seems
to be the minimal required. I don't think it matters that the lock
attempts occur uselessly until the first segment installation. That
being said, we could avoid it by initializing
last_known_installed_segno properly.

I agree with avoiding unconditional fsyncs as well. 0002 seems safer in terms
of avoiding contention. I attached a patch initing the
last_known_installed_segno to the segno of the starting checkpoint. I added
some more commentary too.

I added a comment that I'm not 100% sure about for durable_rename(). It was my
attempt at making other users of durable_rename() aware of a generalization of
the issue we are trying to tackle here.

Are there other uses of durable_rename() that can suffer a similar issue? For
instance, relmapper seems safe because we rely on RelationMappingLock. However,
does the wal summarizer?

Generally speaking, it's bad that durable_rename() has this window that makes
it less atomic than it seems. Ideally, we could have held some generic lock to
prevent that perhaps. Like have some file API within fd.c to grab a lock to
open a rename-able file, with the same LWLock being held during a
durable_rename()? Guess it would invite more contention though, and surrounding
code paths are already holding other LWLocks, which we may not be able to get
rid of.

Regards,
Soumyadeep (Broadcom)

Attachments:

v3-0001-InstallXLogFileSegment-vs-concurrent-WAL-flush.patchtext/x-patch; charset=US-ASCII; name=v3-0001-InstallXLogFileSegment-vs-concurrent-WAL-flush.patchDownload

From 7a974a0de9838968d0b41f42cd3922736e6a2918 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Thu, 31 Oct 2024 01:00:05 -0700
Subject: [PATCH v3 1/1] InstallXLogFileSegment() vs concurrent WAL flush

If the checkpointer is recycling a WAL segment file or another backend
is installing a new file, we mustn't allow other processes to write data
into the file before its name is durable.  Otherwise the data could
become unreachable in recovery after a power loss.
---
 src/backend/access/transam/xlog.c | 41 +++++++++++++++++++++++++++++++
 src/backend/storage/file/fd.c     |  4 +++
 2 files changed, 45 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5a2801e482..a28a9a5dbe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -459,6 +459,11 @@ typedef struct XLogCtlData
 
 	/* Fake LSN counter, for unlogged relations. */
 	pg_atomic_uint64 unloggedLSN;
+	/*
+	 * Approximation of the last WAL segment number that is known to have been
+	 * installed by InstallXLogFileSegment().
+	 */
+	pg_atomic_uint64 last_known_installed_segno;
 
 	/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
 	pg_time_t	lastSegSwitchTime;
@@ -3226,7 +3231,28 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 					 errmsg("could not open file \"%s\": %m", path)));
 	}
 	else
+	{
+		/*
+		 * The file is there, but it is possible that InstallXLogFileSegment()
+		 * has recently renamed it and not yet made the new name durable.  We
+		 * don't want to be able to flush data into a file whose name might
+		 * not survive power loss, since it would become unreachable in
+		 * recovery.  Since InstallXlogFileSegment() holds ControlFileLock,
+		 * acquiring it here is enough to wait for any durable_rename() call
+		 * that might have started before we opened the file.
+		 *
+		 * We can skip that if we can already see that the WAL space we need
+		 * is fully synchronized.  We may see a slightly out of date value
+		 * since we haven't acquired the lock yet, but that's OK, it just
+		 * means we might take the lock when we don't need to.
+		 */
+		if (pg_atomic_read_u64(&XLogCtl->last_known_installed_segno) < logsegno)
+		{
+			LWLockAcquire(ControlFileLock, LW_SHARED);
+			LWLockRelease(ControlFileLock);
+		}
 		return fd;
+	}
 
 	/*
 	 * Initialize an empty (all zeroes) segment.  NOTE: it is possible that
@@ -3561,6 +3587,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 
 	XLogFilePath(path, tli, *segno, wal_segment_size);
 
+	/*
+	 * Acquire and keep the ControlFileLock held *until* we have renamed the
+	 * target segment durably. See XLogFileInitInternal() for details as to why
+	 * it is dangerous otherwise.
+	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (!XLogCtl->InstallXLogFileSegmentActive)
 	{
@@ -3597,6 +3628,8 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		return false;
 	}
 
+	pg_atomic_write_u64(&XLogCtl->last_known_installed_segno, *segno);
+
 	LWLockRelease(ControlFileLock);
 
 	return true;
@@ -4913,6 +4946,7 @@ XLOGShmemInit(void)
 	char	   *allocptr;
 	int			i;
 	ControlFileData *localControlFile;
+	XLogSegNo 	lastKnownInstalledSegno = 0;
 
 #ifdef WAL_DEBUG
 
@@ -4960,6 +4994,12 @@ XLOGShmemInit(void)
 	{
 		memcpy(ControlFile, localControlFile, sizeof(ControlFileData));
 		pfree(localControlFile);
+		/*
+		 * A decent approximation for the last known installed WAL segment
+		 * number can be the segment in which the checkpoint record resides,
+		 * specially in cases where we have had a clean shutdown.
+		 */
+		XLByteToSeg(ControlFile->checkPoint, lastKnownInstalledSegno, wal_segment_size);
 	}
 
 	/*
@@ -5014,6 +5054,7 @@ XLOGShmemInit(void)
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+	pg_atomic_init_u64(&XLogCtl->last_known_installed_segno, lastKnownInstalledSegno);
 }
 
 /*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 42bf857e87..44e019a8db 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -772,6 +772,10 @@ fsync_fname(const char *fname, bool isdir)
  * might not be on the same filesystem. Therefore this routine does not
  * support renaming across directories.
  *
+ * Note that there is a window between the rename and the fsync(s). If "newfile"
+ * is opened, written to and then fdatasynced, and if there is a crash before
+ * the fsync(s) hits disk, the written data could be .
+ *
  * Log errors with the caller specified severity.
  *
  * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
-- 
2.43.0