low wal_retrieve_retry_interval causes missed signals on Windows

Started by Nathan Bossartabout 3 years ago5 messages
#1Nathan Bossart
nathandbossart@gmail.com
1 attachment(s)

I discussed this elsewhere [0]/messages/by-id/20221231235019.GA1223171@nathanxps13, but I thought it deserved its own thread.

After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed
that some of the archiving tests began consistently failing on Windows. I
believe the problem is that WaitForWALToBecomeAvailable() depends on the
call to WaitLatch() for wal_retrieve_retry_interval to ensure that signals
are dispatched (i.e., pgwin32_dispatch_queued_signals()). With a low retry
interval, WaitForWALToBecomeAvailable() might skip the call to WaitLatch(),
and the signals are never processed.

The attached patch fixes this by always calling WaitLatch(), even if
wal_retrieve_retry_interval milliseconds have already elapsed and the
timeout is 0.

[0]: /messages/by-id/20221231235019.GA1223171@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-ensure-signals-are-dispatched-in-startup-process-.patchtext/x-diff; charset=us-asciiDownload
From d08f01156ce1ce1290bad440da97852c2aa6c24b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 31 Dec 2022 15:16:54 -0800
Subject: [PATCH v1 1/1] ensure signals are dispatched in startup process on
 Windows

---
 src/backend/access/transam/xlogrecovery.c | 36 ++++++++++++++---------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..b1a4c4ab6d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3466,6 +3466,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			long		wait_time;
+
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
 			 * readahead.  Let the caller process everything that has been
@@ -3556,33 +3558,39 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and retry from the archive, but if it hasn't been long
 					 * since last attempt, sleep wal_retrieve_retry_interval
 					 * milliseconds to avoid busy-waiting.
+					 *
+					 * NB: Even if it's already been wal_retrieve_rety_interval
+					 * milliseconds since the last attempt, we still call
+					 * WaitLatch() with the timeout set to 0 to make sure any
+					 * queued signals are dispatched on Windows builds.
 					 */
 					now = GetCurrentTimestamp();
 					if (!TimestampDifferenceExceeds(last_fail_time, now,
 													wal_retrieve_retry_interval))
 					{
-						long		wait_time;
-
 						wait_time = wal_retrieve_retry_interval -
 							TimestampDifferenceMilliseconds(last_fail_time, now);
 
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
+					}
+					else
+						wait_time = 0;
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
+					/* Do background tasks that might benefit us later. */
+					KnownAssignedTransactionIdsIdleMaintenance();
 
-						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
-										 WL_LATCH_SET | WL_TIMEOUT |
-										 WL_EXIT_ON_PM_DEATH,
-										 wait_time,
-										 WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
-						ResetLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
-						now = GetCurrentTimestamp();
+					(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
+									 WL_LATCH_SET | WL_TIMEOUT |
+									 WL_EXIT_ON_PM_DEATH,
+									 wait_time,
+									 WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
+					ResetLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
+					now = GetCurrentTimestamp();
+
+					/* Handle interrupt signals of startup process */
+					HandleStartupProcInterrupts();
 
-						/* Handle interrupt signals of startup process */
-						HandleStartupProcInterrupts();
-					}
 					last_fail_time = now;
 					currentSource = XLOG_FROM_ARCHIVE;
 					break;
-- 
2.25.1

#2Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#1)
Re: low wal_retrieve_retry_interval causes missed signals on Windows

Hi,

On 2023-01-10 22:11:16 -0800, Nathan Bossart wrote:

The attached patch fixes this by always calling WaitLatch(), even if
wal_retrieve_retry_interval milliseconds have already elapsed and the
timeout is 0.

It doesn't seem right to call WaitLatch() just for that purpose - nor
necessarily a complete fix.

Given that we check for interrupts in other parts of recovery with
HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the
actual bug that HandleStartupProcInterrupt() doesn't contain the same black
magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff:

#ifndef WIN32
...
#else
#define INTERRUPTS_PENDING_CONDITION() \
(unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0, \
unlikely(InterruptPending))
#endif

/* Service interrupt, if one is pending and it's safe to service it now */
#define CHECK_FOR_INTERRUPTS() \
do { \
if (INTERRUPTS_PENDING_CONDITION()) \
ProcessInterrupts(); \
} while(0)

Looks like we have that bug in quite a few places... Some are "protected" by
unconditional WaitLatch() calls, but at least pgarch.c, checkpointer.c via
CheckpointWriteDelay() seem borked.

Greetings,

Andres Freund

#3Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#2)
1 attachment(s)
Re: low wal_retrieve_retry_interval causes missed signals on Windows

On Wed, Jan 11, 2023 at 12:48:36PM -0800, Andres Freund wrote:

Given that we check for interrupts in other parts of recovery with
HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the
actual bug that HandleStartupProcInterrupt() doesn't contain the same black
magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff:

Yeah, this seems like a more comprehensive fix. I've attached a patch that
adds this Windows signaling stuff to the HandleXXXInterrupts() functions in
the files you listed. Is this roughly what you had in mind? If so, I'll
look around for anywhere else it is needed.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

dispatch-signals-on-windows-v2.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index de0bbbfa79..e57be4f72b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -543,6 +543,11 @@ CheckpointerMain(void)
 static void
 HandleCheckpointerInterrupts(void)
 {
+#ifdef WIN32
+	if (UNBLOCKED_SIGNAL_QUEUE())
+		pgwin32_dispatch_queued_signals();
+#endif
+
 	if (ProcSignalBarrierPending)
 		ProcessProcSignalBarrier();
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 8ecdb9ca23..68184894fe 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -777,6 +777,11 @@ pgarch_die(int code, Datum arg)
 static void
 HandlePgArchInterrupts(void)
 {
+#ifdef WIN32
+	if (UNBLOCKED_SIGNAL_QUEUE())
+		pgwin32_dispatch_queued_signals();
+#endif
+
 	if (ProcSignalBarrierPending)
 		ProcessProcSignalBarrier();
 
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 8786186898..0e8cb3a174 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -171,6 +171,11 @@ HandleStartupProcInterrupts(void)
 	static uint32 postmaster_poll_count = 0;
 #endif
 
+#ifdef WIN32
+	if (UNBLOCKED_SIGNAL_QUEUE())
+		pgwin32_dispatch_queued_signals();
+#endif
+
 	/*
 	 * Process any requests or signals received recently.
 	 */
#4Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#3)
Re: low wal_retrieve_retry_interval causes missed signals on Windows

Hi,

On 2023-01-11 15:26:45 -0800, Nathan Bossart wrote:

On Wed, Jan 11, 2023 at 12:48:36PM -0800, Andres Freund wrote:

Given that we check for interrupts in other parts of recovery with
HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the
actual bug that HandleStartupProcInterrupt() doesn't contain the same black
magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff:

Yeah, this seems like a more comprehensive fix. I've attached a patch that
adds this Windows signaling stuff to the HandleXXXInterrupts() functions in
the files you listed. Is this roughly what you had in mind? If so, I'll
look around for anywhere else it is needed.

Yes, that's what I roughly was thinking of. Although seeing the diff, I think
it might be worth introducing a helper function that'd containing at least
pgwin32_dispatch_queued_signals() and ProcessProcSignalBarrier(). It's a bit
complicated by ProcessProcSignalBarrier() only being applicable to shared
memory connected processes - excluding e.g. pgarch.

Greetings,

Andres Freund

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#4)
Re: low wal_retrieve_retry_interval causes missed signals on Windows

On Wed, Jan 11, 2023 at 04:40:14PM -0800, Andres Freund wrote:

On 2023-01-11 15:26:45 -0800, Nathan Bossart wrote:

On Wed, Jan 11, 2023 at 12:48:36PM -0800, Andres Freund wrote:

Given that we check for interrupts in other parts of recovery with
HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the
actual bug that HandleStartupProcInterrupt() doesn't contain the same black
magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff:

Yeah, this seems like a more comprehensive fix. I've attached a patch that
adds this Windows signaling stuff to the HandleXXXInterrupts() functions in
the files you listed. Is this roughly what you had in mind? If so, I'll
look around for anywhere else it is needed.

Yes, that's what I roughly was thinking of. Although seeing the diff, I think
it might be worth introducing a helper function that'd containing at least
pgwin32_dispatch_queued_signals() and ProcessProcSignalBarrier(). It's a bit
complicated by ProcessProcSignalBarrier() only being applicable to shared
memory connected processes - excluding e.g. pgarch.

As of d75288f, the archiver should be connected to shared memory, so we
might be in luck. I guess we'd need to watch out for this if we want to
back-patch it beyond v14. I'll work on a patch...

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com