Race between KeepFileRestoredFromArchive() and restartpoint

Started by Noah Mischabout 5 years ago15 messageshackers
Jump to latest
#1Noah Misch
noah@leadboat.com

A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as
seen once on the buildfarm[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17. The attached patch adds a test case; it
applies atop the "stop events" patch[2]/messages/by-id/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com. We have two systems for adding
long-term pg_wal directory entries. KeepFileRestoredFromArchive() adds them
during archive recovery, while InstallXLogFileSegment() does so at all times.
Unfortunately, InstallXLogFileSegment() happens throughout archive recovery,
via the checkpointer recycling segments and calling PreallocXlogFiles().
Multiple processes can run InstallXLogFileSegment(), which uses
ControlFileLock to represent the authority to modify the directory listing of
pg_wal. KeepFileRestoredFromArchive() just assumes it controls pg_wal.

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Notable alternatives:

- Release ControlFileLock at the end of XLogFileInit(), not at the end of
InstallXLogFileSegment(). Add ControlFileLock acquisition to
KeepFileRestoredFromArchive(). This provides adequate mutual exclusion, but
XLogFileInit() could return a file descriptor for an unlinked file. That's
fine for PreallocXlogFiles(), but it feels wrong.

- During restartpoints, never preallocate or recycle segments. (Just delete
obsolete WAL.) By denying those benefits, this presumably makes streaming
recovery less efficient.

- Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment,
then copy bytes. This is simple, but it multiplies I/O. That might be
negligible on account of caching, or it might not be. A variant, incurring
extra fsyncs, would be to use durable_rename() to replace the segment we get
from XLogFileInit().

- Make KeepFileRestoredFromArchive() rename without first unlinking. This
avoids checkpoint failure, but a race could trigger noise from the LOG
message in InstallXLogFileSegment -> durable_rename_excl.

Does anyone prefer some alternative? It's probably not worth back-patching
anything for a restartpoint failure this rare, because most restartpoint
outcomes are not user-visible.

Thanks,
nm

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17
[2]: /messages/by-id/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com

Attachments:

repro-KeepFileRestoredFromArchive-v0.patchtext/plain; charset=us-asciiDownload+159-2
#2Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#1)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

Attachments:

XLogFileInit1-use_lock-v1.patchtext/plain; charset=us-asciiDownload+16-34
XLogFileInit2-redefine-use_existent-v1.patchtext/plain; charset=us-asciiDownload+6-7
XLogFileInit3-write-only-use_existent-v1.patchtext/plain; charset=us-asciiDownload+28-40
XLogFileInit4-PreallocXlogFiles-v1.patchtext/plain; charset=us-asciiDownload+58-27
XLogFileInit5-lock-v1.patchtext/plain; charset=us-asciiDownload+57-8
#3David Steele
david@pgmasters.net
In reply to: Noah Misch (#2)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

Hi Noah,

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1]/messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com.

I wonder if that would make it worth a back patch?

[1]: /messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com
/messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com

#4Noah Misch
noah@leadboat.com
In reply to: David Steele (#3)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

Show quoted text

[1] /messages/by-id/CAKw-smBhLOGtRJTC5c=qKTPz8gz5+WPoVAXrHB6mY-1U4_N7-w@mail.gmail.com

#5David Steele
david@pgmasters.net
In reply to: Noah Misch (#4)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people
from wasting time tracking down a problem that has already been fixed.

However, there is another issue [1]/messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com that might argue for a back patch if
this patch (as I believe) would fix the issue.

Regards,
-David

[1]: /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com
/messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

#6Noah Misch
noah@leadboat.com
In reply to: David Steele (#5)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

#7David Steele
david@pgmasters.net
In reply to: Noah Misch (#6)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On 8/2/22 10:37, Noah Misch wrote:

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

Let's see if Don can confirm that he is seeing the "could not link file"
messages.

Regards,
-David

#8Don Seiler
don@seiler.us
In reply to: David Steele (#7)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:

That makes sense. Each iteration of the restartpoint recycle loop has a

1/N

chance of failing. Recovery adds >N files between restartpoints.

Hence, the

WAL directory grows without bound. Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

Let's see if Don can confirm that he is seeing the "could not link file"
messages.

During my latest incident, there was only one occurrence:

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

WAL restore/recovery seemed to continue on just fine then. And it would
continue on until the pg_wal volume ran out of space unless I was manually
rm'ing already-recovered WAL files from the side.

--
Don Seiler
www.seiler.us

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Don Seiler (#8)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in

On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:

That makes sense. Each iteration of the restartpoint recycle loop has a

1/N

chance of failing. Recovery adds >N files between restartpoints.

Hence, the

WAL directory grows without bound. Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

I'm not sure I understand it correctly, but isn't the cause of the
issue in the other thread due to skipping many checkpoint records
within the checkpoint_timeout? I remember that I proposed a GUC
variable to disable that checkpoint skipping. As another measure for
that issue, we could force replaying checkpoint if max_wal_size is
already filled up or known to be filled in the next checkpoint cycle.
If this is correct, this patch is irrelevant to the issue.

Let's see if Don can confirm that he is seeing the "could not link file"
messages.

During my latest incident, there was only one occurrence:

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

(I noticed that the patch in the other thread is broken:()

Hmm. It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

WAL restore/recovery seemed to continue on just fine then. And it would
continue on until the pg_wal volume ran out of space unless I was manually
rm'ing already-recovered WAL files from the side.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Noah Misch
noah@leadboat.com
In reply to: Kyotaro Horiguchi (#9)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:

At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

Hmm. It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The
intent of find_free=true is to make it okay to pass a target xlog file that
ceases to be a good target. (InstallXLogFileSegment() searches for a good
target while holding ControlFileLock.) Can you say more about how that proved
to be insufficient?

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Noah Misch (#10)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

At Wed, 3 Aug 2022 00:28:47 -0700, Noah Misch <noah@leadboat.com> wrote in

On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:

At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in

could not link file “pg_wal/xlogtemp.18799" to

“pg_wal/000000010000D45300000010”: File exists

Hmm. It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The
intent of find_free=true is to make it okay to pass a target xlog file that
ceases to be a good target. (InstallXLogFileSegment() searches for a good
target while holding ControlFileLock.) Can you say more about how that proved
to be insufficient?

Ug.. No. I can't. I was confused by something. Sorry.

PreallocXlogFiles() and checkpointer are mutually excluded by the same
lock, too.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#6)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Tue, Aug 02, 2022 at 07:37:27AM -0700, Noah Misch wrote:

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries. In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off). XLogFileInit() will fail.

Here's the implementation. Patches 1-4 suffice to stop the user-visible
ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation. Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit. Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint. Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

Perhaps. It's sad to have multiple people deep-diving into something fixed on
HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on
this. One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless." That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

Attachments:

XLogFileInit6-hint-v1.patchtext/plain; charset=us-asciiDownload+12-3
#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Noah Misch (#12)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

+1

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

I agree to this.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14David Steele
david@pgmasters.net
In reply to: Kyotaro Horiguchi (#13)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On 8/4/22 04:06, Kyotaro Horiguchi wrote:

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

+1

Looks good to me as well.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

I agree to this.

Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal
growing without bound. Even if we are not sure what is causing it, how
confident are we that the patches applied to v15 would fix it?

Regards,
-David

#15Noah Misch
noah@leadboat.com
In reply to: David Steele (#14)
Re: Race between KeepFileRestoredFromArchive() and restartpoint

On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote:

On 8/4/22 04:06, Kyotaro Horiguchi wrote:

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT. I figure it's better to do this before next
week's minor releases. In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

+1

Looks good to me as well.

Thanks for reviewing.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.

[1] /messages/by-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx=4W7KxNB4zzt++qFg@mail.gmail.com

That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
chance of failing. Recovery adds >N files between restartpoints. Hence, the
WAL directory grows without bound. Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way. Each failure
message is LOG-level, so the remaining recycles still happen. (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

I agree to this.

Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing
without bound. Even if we are not sure what is causing it, how confident are
we that the patches applied to v15 would fix it?

I'll say 7%; certainly too low to stop investigating. The next things I'd want
to see are:

- select name, setting, source from pg_settings where setting <> boot_val;
- log_checkpoints log entries, and other log entries having the same PID

If the theory about checkpoint skipping holds, there should be a period where
the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints
begin and 0 restartpoints complete.