Checkpointer sync queue fills up / loops around pg_usleep() are bad
Hi,
In two recent investigations in occasional test failures
(019_replslot_limit.pl failures, AIO rebase) the problems are somehow tied to
checkpointer.
I don't yet know if actually causally related to precisely those failures, but
when running e.g. 027_stream_regress.pl, I see phases in which many backends
are looping in RegisterSyncRequest() repeatedly, each time sleeping with
pg_usleep(10000L).
Without adding instrumentation this is completely invisible at any log
level. There's no log messages, there's no wait events, nothing.
ISTM, we should not have any loops around pg_usleep(). And shorter term, we
shouldn't have any loops around pg_usleep() that don't emit log messages / set
wait events. Therefore I propose that we "prohibit" such loops without at
least a DEBUG2 elog() or so. It's just too hard to debug.
The reason for the sync queue filling up in 027_stream_regress.pl is actually
fairly simple:
1) The test runs with shared_buffers = 1MB, leading to a small sync queue of
128 entries.
2) CheckpointWriteDelay() does pg_usleep(100000L)
ForwardSyncRequest() wakes up the checkpointer using SetLatch() if the sync
queue is more than half full.
But at least on linux and freebsd that doesn't actually interrupt pg_usleep()
anymore (due to using signalfd / kqueue rather than a signal handler). And on
all platforms the signal might arrive just before the pg_usleep() rather than
during, also not causing usleep to be interrupted.
If I shorten the sleep in CheckpointWriteDelay() the problem goes away. This
actually reduces the time for a single run of 027_stream_regress.pl on my
workstation noticably. With default sleep time it's ~32s, with shortened time
it's ~27s.
I suspect we need to do something about this concrete problem for 14 and
master, because it's certainly worse than before on linux / freebsd.
I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?
Greetings,
Andres Freund
On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:
I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?
We've added wait events in back-branches in the past, so this does not
strike me as a problem as long as you add the new entry at the end of
the enum, while keeping things ordered on HEAD. In recent memory, I
think that only some of the extensions published by PostgresPro rely
on the enums in this area.
--
Michael
On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote:
On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:
I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?We've added wait events in back-branches in the past, so this does not
strike me as a problem as long as you add the new entry at the end of
the enum, while keeping things ordered on HEAD.
+1
In recent memory, I
think that only some of the extensions published by PostgresPro rely
on the enums in this area.
Indeed, I only know of pg_wait_sampling which uses it. Note that it relies on
pgstat_get_wait_event* functions, so it should only returns "???" / "unknown
wait event" until you recompile it for a newer minor version and not report
errors or crash. All other extensions I know of simply use whatever
pg_stat_activity returns, so no impact.
On Sun, Feb 27, 2022 at 10:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote:
On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:
I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?We've added wait events in back-branches in the past, so this does not
strike me as a problem as long as you add the new entry at the end of
the enum, while keeping things ordered on HEAD.+1
+1
Sleeps like these are also really bad for ProcSignalBarrier, which I
was expecting to be the impetus for fixing any remaining loops like
this.
With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s
on my FreeBSD workstation!
It seems a little strange to introduce a new wait event that will very
often appear into a stable branch, but ... it is actually telling the
truth, so there is that.
The sleep/poll loop in RegisterSyncRequest() may also have another
problem. The comment explains that it was a deliberate choice not to
do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
think there's an excuse to ignore postmaster death in a loop that
presumably becomes infinite if the checkpointer exits. I guess we
could do:
- pg_usleep(10000L);
+ WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
WAIT_EVENT_SYNC_REQUEST);
But... really, this should be waiting on a condition variable that the
checkpointer broadcasts on when the queue goes from full to not full,
no? Perhaps for master only?
Attachments:
0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchtext/x-patch; charset=US-ASCII; name=0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchDownload+12-2
Hi,
On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s
on my FreeBSD workstation!
That's impressive - wouldn't have guessed it to make that much of a difference. I assume that running the tests on freebsd for an older pg with a similar s_b & max_wal_size doesn't benefit as much? I wonder how much windows will improve.
It seems a little strange to introduce a new wait event that will very
often appear into a stable branch, but ... it is actually telling the
truth, so there is that.
In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD.
I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule...
The sleep/poll loop in RegisterSyncRequest() may also have another
problem. The comment explains that it was a deliberate choice not to
do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
think there's an excuse to ignore postmaster death in a loop that
presumably becomes infinite if the checkpointer exits. I guess we
could do:- pg_usleep(10000L); + WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10, WAIT_EVENT_SYNC_REQUEST);But... really, this should be waiting on a condition variable that the
checkpointer broadcasts on when the queue goes from full to not full,
no? Perhaps for master only?
Looks worth improving, but yes, I'd not do it in the back branches.
I do think it's worth giving that sleep a proper wait event though, even in the back branches.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Feb 28, 2022 at 2:36 PM Andres Freund <andres@anarazel.de> wrote:
On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
It seems a little strange to introduce a new wait event that will very
often appear into a stable branch, but ... it is actually telling the
truth, so there is that.In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD.
Yeah.
I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule...
I might look into this.
The sleep/poll loop in RegisterSyncRequest() may also have another
problem. The comment explains that it was a deliberate choice not to
do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
think there's an excuse to ignore postmaster death in a loop that
presumably becomes infinite if the checkpointer exits. I guess we
could do:- pg_usleep(10000L); + WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10, WAIT_EVENT_SYNC_REQUEST);But... really, this should be waiting on a condition variable that the
checkpointer broadcasts on when the queue goes from full to not full,
no? Perhaps for master only?Looks worth improving, but yes, I'd not do it in the back branches.
0003 is a first attempt at that, for master only (on top of 0002 which
is the minimal fix). This shaves another second off
027_stream_regress.pl on my workstation. The main thing I realised is
that I needed to hold interrupts while waiting, which seems like it
should go away with 'tombstone' files as discussed in other threads.
That's not a new problem in this patch, it just looks more offensive
to the eye when you spell it out, instead of hiding it with an
unreported sleep/poll loop...
I do think it's worth giving that sleep a proper wait event though, even in the back branches.
I'm thinking that 0002 should be back-patched all the way, but 0001
could be limited to 14.
Attachments:
v2-0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchDownload+15-2
v2-0002-Fix-waiting-in-RegisterSyncRequest.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Fix-waiting-in-RegisterSyncRequest.patchDownload+12-2
v2-0003-Use-condition-variable-to-wait-when-sync-request-.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Use-condition-variable-to-wait-when-sync-request-.patchDownload+62-41
Hi,
On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:
I do think it's worth giving that sleep a proper wait event though, even in the back branches.
I'm thinking that 0002 should be back-patched all the way, but 0001
could be limited to 14.
No strong opinion on back to where to backpatch. It'd be nice to have a proper
wait event everywhere, but especially < 12 it looks different enough to be
some effort.
From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 28 Feb 2022 11:27:05 +1300
Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().
LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
times :)
From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 11:38:27 +1300
Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().
LGTM.
From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 17:34:43 +1300
Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
is full.Previously, in the (hopefully) rare case that we need to wait for the
checkpointer to create space in the sync request queue, we'd enter a
sleep/retry loop. Instead, create a condition variable so the
checkpointer can wake us up whenever there is a transition from 'full'
to 'not full'.
@@ -1076,10 +1078,11 @@ RequestCheckpoint(int flags) * to perform its own fsync, which is far more expensive in practice. It * is theoretically possible a backend fsync might still be necessary, if * the queue is full and contains no duplicate entries. In that case, we - * let the backend know by returning false. + * let the backend know by returning false, or if 'wait' is true, then we + * wait for space to become available. */ bool -ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) +ForwardSyncRequest(const FileTag *ftag, SyncRequestType type, bool wait) { CheckpointerRequest *request; bool too_full; @@ -1101,9 +1104,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) * backend will have to perform its own fsync request. But before forcing * that to happen, we can try to compact the request queue. */ - if (CheckpointerShmem->checkpointer_pid == 0 || - (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests && - !CompactCheckpointerRequestQueue())) + if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests && + !CompactCheckpointerRequestQueue() && + !wait)
Bit confused about the addition of the wait parameter / removal of the
CheckpointerShmem->checkpointer_pid check. What's that about?
+ /* + * If we still don't have enough space and the caller asked us to wait, + * wait for the checkpointer to advertise that there is space. + */ + if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests) + { + ConditionVariablePrepareToSleep(&CheckpointerShmem->requests_not_full_cv); + while (CheckpointerShmem->num_requests >= + CheckpointerShmem->max_requests) + { + LWLockRelease(CheckpointerCommLock); + ConditionVariableSleep(&CheckpointerShmem->requests_not_full_cv, + WAIT_EVENT_FORWARD_SYNC_REQUEST); + LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); + } + ConditionVariableCancelSleep(); + }
Could there be a problem with a lot of backends trying to acquire
CheckpointerCommLock in exclusive mode? I'm inclined to think it's rare enough
to not worry.
Greetings,
Andres Freund
On Wed, Mar 2, 2022 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:
From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 28 Feb 2022 11:27:05 +1300
Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
times :)
Thanks for the review. Pushed to master and 14, with the wait event
moved to the end of the enum for the back-patch.
From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 11:38:27 +1300
Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().LGTM.
Pushed as far back as 12. It could be done for 10 & 11, but indeed
the code starts getting quite different back there, and since there
are no field reports, I think that's OK for now.
A simple repro, for the record: run installcheck with
shared_buffers=256kB, and then partway through, kill -STOP
$checkpointer to simulate being stalled on IO for a while. Backends
will soon start waiting for the checkpointer to drain the queue while
dropping relations. This state was invisible to pg_stat_activity, and
hangs forever if you kill the postmaster and CONT the checkpointer.
From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 17:34:43 +1300
Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
is full.
[review]
I'll come back to 0003 (condition variable-based improvement) a bit later.