delay starting WAL receiver
I discussed this a bit in a different thread [0]/messages/by-id/20221215224721.GA694065@nathanxps13, but I thought it deserved
its own thread.
After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed
that the recovery tests consistently take much longer. Upon further
inspection, it looks like a similar race condition to the one described in
e5d494d's commit message. With some added debug logs, I see that all of
the callers of MaybeStartWalReceiver() complete before SIGCHLD is
processed, so ServerLoop() waits for a minute before starting the WAL
receiver.
The attached patch fixes this by adjusting DetermineSleepTime() to limit
the sleep to at most 100ms when WalReceiverRequested is set, similar to how
the sleep is limited when background workers must be restarted.
[0]: /messages/by-id/20221215224721.GA694065@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-handle-race-condition-when-restarting-wal-receive.patchtext/x-diff; charset=us-asciiDownload
From 6f32238f119236322dfb16262a88c3a9c5141413 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Thu, 15 Dec 2022 14:11:43 -0800
Subject: [PATCH v1 1/1] handle race condition when restarting wal receivers
---
src/backend/postmaster/postmaster.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eac3450774..9d6f58e0b3 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1601,9 +1601,9 @@ checkControlFile(void)
*
* In normal conditions we wait at most one minute, to ensure that the other
* background tasks handled by ServerLoop get done even when no requests are
- * arriving. However, if there are background workers waiting to be started,
- * we don't actually sleep so that they are quickly serviced. Other exception
- * cases are as shown in the code.
+ * arriving. However, if there are background workers or a WAL receiver
+ * waiting to be started, we make sure they are quickly serviced. Other
+ * exception cases are as shown in the code.
*/
static void
DetermineSleepTime(struct timeval *timeout)
@@ -1611,11 +1611,12 @@ DetermineSleepTime(struct timeval *timeout)
TimestampTz next_wakeup = 0;
/*
- * Normal case: either there are no background workers at all, or we're in
- * a shutdown sequence (during which we ignore bgworkers altogether).
+ * Normal case: either there are no background workers and no WAL receiver
+ * at all, or we're in a shutdown sequence (during which we ignore
+ * bgworkers and the WAL receiver altogether).
*/
if (Shutdown > NoShutdown ||
- (!StartWorkerNeeded && !HaveCrashedWorker))
+ (!StartWorkerNeeded && !HaveCrashedWorker && !WalReceiverRequested))
{
if (AbortStartTime != 0)
{
@@ -1674,6 +1675,21 @@ DetermineSleepTime(struct timeval *timeout)
}
}
+ /*
+ * If WalReceiverRequested is set, we're probably waiting on a SIGCHLD to
+ * arrive to clear WalReceiverPID before starting the new WAL receiver. We
+ * don't expect that to take long, so limit the sleep to 100ms so that we
+ * start the new WAL receiver promptly.
+ */
+ if (WalReceiverRequested)
+ {
+ TimestampTz walrcv_wakeup;
+
+ walrcv_wakeup = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), 100);
+ if (next_wakeup == 0 || walrcv_wakeup < next_wakeup)
+ next_wakeup = walrcv_wakeup;
+ }
+
if (next_wakeup != 0)
{
long secs;
--
2.25.1
On Wed, Jan 11, 2023 at 2:08 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I discussed this a bit in a different thread [0], but I thought it deserved
its own thread.After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed
that the recovery tests consistently take much longer. Upon further
inspection, it looks like a similar race condition to the one described in
e5d494d's commit message. With some added debug logs, I see that all of
the callers of MaybeStartWalReceiver() complete before SIGCHLD is
processed, so ServerLoop() waits for a minute before starting the WAL
receiver.The attached patch fixes this by adjusting DetermineSleepTime() to limit
the sleep to at most 100ms when WalReceiverRequested is set, similar to how
the sleep is limited when background workers must be restarted.
Is the problem here that SIGCHLD is processed ...
PG_SETMASK(&UnBlockSig); <--- here?
selres = select(nSockets, &rmask, NULL, NULL, &timeout);
Meanwhile the SIGCHLD handler code says:
* Was it the wal receiver? If exit status is zero (normal) or one
* (FATAL exit), we assume everything is all right just like normal
* backends. (If we need a new wal receiver, we'll start one at the
* next iteration of the postmaster's main loop.)
... which is true, but that won't be reached for a while in this case
if the timeout has already been set to 60s. Your patch makes that
100ms, in that case, a time delay that by now attracts my attention
like a red rag to a bull (I don't know why you didn't make it 0).
I'm not sure, but if I got that right, then I think the whole problem
might automatically go away with CF #4032. The SIGCHLD processing
code will run not when signals are unblocked before select() (that is
gone), but instead *after* the event loop wakes up with WL_LATCH_SET,
and runs the handler code in the regular user context before dropping
through to the rest of the main loop.
On Wed, Jan 11, 2023 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
(I don't know why you didn't make it 0)
(Oh, I see why it had to be non-zero to avoiding burning CPU, ignore that part.)
On Wed, Jan 11, 2023 at 05:20:38PM +1300, Thomas Munro wrote:
Is the problem here that SIGCHLD is processed ...
PG_SETMASK(&UnBlockSig); <--- here?
selres = select(nSockets, &rmask, NULL, NULL, &timeout);
Meanwhile the SIGCHLD handler code says:
* Was it the wal receiver? If exit status is zero (normal) or one
* (FATAL exit), we assume everything is all right just like normal
* backends. (If we need a new wal receiver, we'll start one at the
* next iteration of the postmaster's main loop.)... which is true, but that won't be reached for a while in this case
if the timeout has already been set to 60s. Your patch makes that
100ms, in that case, a time delay that by now attracts my attention
like a red rag to a bull (I don't know why you didn't make it 0).
I think this is right. At the very least, it seems consistent with my
observations.
I'm not sure, but if I got that right, then I think the whole problem
might automatically go away with CF #4032. The SIGCHLD processing
code will run not when signals are unblocked before select() (that is
gone), but instead *after* the event loop wakes up with WL_LATCH_SET,
and runs the handler code in the regular user context before dropping
through to the rest of the main loop.
Yeah, with those patches, the problem goes away. IIUC the key part is that
the postmaster's latch gets set when SIGCHLD is received, so even if
SIGUSR1 and SIGCHLD are processed out of order, WalReceiverPID gets cleared
and we try to restart it shortly thereafter. I find this much easier to
reason about.
I'll go ahead and withdraw this patch from the commitfest. Thanks for
chiming in.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com