Checkpointer sync queue fills up / loops around pg_usleep() are bad

Started by Andres Freundalmost 4 years ago8 messages
#1Andres Freund
andres@anarazel.de

Hi,

In two recent investigations in occasional test failures
(019_replslot_limit.pl failures, AIO rebase) the problems are somehow tied to
checkpointer.

I don't yet know if actually causally related to precisely those failures, but
when running e.g. 027_stream_regress.pl, I see phases in which many backends
are looping in RegisterSyncRequest() repeatedly, each time sleeping with
pg_usleep(10000L).

Without adding instrumentation this is completely invisible at any log
level. There's no log messages, there's no wait events, nothing.

ISTM, we should not have any loops around pg_usleep(). And shorter term, we
shouldn't have any loops around pg_usleep() that don't emit log messages / set
wait events. Therefore I propose that we "prohibit" such loops without at
least a DEBUG2 elog() or so. It's just too hard to debug.

The reason for the sync queue filling up in 027_stream_regress.pl is actually
fairly simple:

1) The test runs with shared_buffers = 1MB, leading to a small sync queue of
128 entries.
2) CheckpointWriteDelay() does pg_usleep(100000L)

ForwardSyncRequest() wakes up the checkpointer using SetLatch() if the sync
queue is more than half full.

But at least on linux and freebsd that doesn't actually interrupt pg_usleep()
anymore (due to using signalfd / kqueue rather than a signal handler). And on
all platforms the signal might arrive just before the pg_usleep() rather than
during, also not causing usleep to be interrupted.

If I shorten the sleep in CheckpointWriteDelay() the problem goes away. This
actually reduces the time for a single run of 027_stream_regress.pl on my
workstation noticably. With default sleep time it's ~32s, with shortened time
it's ~27s.

I suspect we need to do something about this concrete problem for 14 and
master, because it's certainly worse than before on linux / freebsd.

I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?

Greetings,

Andres Freund

#2Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#1)
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:

I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?

We've added wait events in back-branches in the past, so this does not
strike me as a problem as long as you add the new entry at the end of
the enum, while keeping things ordered on HEAD. In recent memory, I
think that only some of the extensions published by PostgresPro rely
on the enums in this area.
--
Michael

#3Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#2)
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote:

On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:

I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?

We've added wait events in back-branches in the past, so this does not
strike me as a problem as long as you add the new entry at the end of
the enum, while keeping things ordered on HEAD.

+1

In recent memory, I
think that only some of the extensions published by PostgresPro rely
on the enums in this area.

Indeed, I only know of pg_wait_sampling which uses it. Note that it relies on
pgstat_get_wait_event* functions, so it should only returns "???" / "unknown
wait event" until you recompile it for a newer minor version and not report
errors or crash. All other extensions I know of simply use whatever
pg_stat_activity returns, so no impact.

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Julien Rouhaud (#3)
1 attachment(s)
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

On Sun, Feb 27, 2022 at 10:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote:

On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:

I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?

We've added wait events in back-branches in the past, so this does not
strike me as a problem as long as you add the new entry at the end of
the enum, while keeping things ordered on HEAD.

+1

+1

Sleeps like these are also really bad for ProcSignalBarrier, which I
was expecting to be the impetus for fixing any remaining loops like
this.

With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s
on my FreeBSD workstation!

It seems a little strange to introduce a new wait event that will very
often appear into a stable branch, but ... it is actually telling the
truth, so there is that.

The sleep/poll loop in RegisterSyncRequest() may also have another
problem. The comment explains that it was a deliberate choice not to
do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
think there's an excuse to ignore postmaster death in a loop that
presumably becomes infinite if the checkpointer exits. I guess we
could do:

-               pg_usleep(10000L);
+               WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
WAIT_EVENT_SYNC_REQUEST);

But... really, this should be waiting on a condition variable that the
checkpointer broadcasts on when the queue goes from full to not full,
no? Perhaps for master only?

Attachments:

0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchtext/x-patch; charset=US-ASCII; name=0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchDownload
From 23abdf95dea74b914dc56a46739a63ff86035f51 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 28 Feb 2022 11:27:05 +1300
Subject: [PATCH] Wake up for latches in CheckpointWriteDelay().

The checkpointer shouldn't ignore its latch.  Other backends may be
waiting for it to drain the request queue.  Hopefully real systems don't
have a full queue often, but the condition is reached easily
when shared buffers is very small.

This involves defining a new wait event, which will appear in the
pg_stat_activity view often due to spread checkpoints.

Back-patch only to 14.  Even though the problem exists in earlier
branches too, it's hard to hit there.  In 14 we stopped using signal
handers for latches on Linux, *BSD and macOS, which were previously
hiding this problem by interrupting the sleep (though not reliably, as
the signal could arrive before the sleep begins; precisely the problem
latches address).

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml            | 4 ++++
 src/backend/postmaster/checkpointer.c   | 5 ++++-
 src/backend/utils/activity/wait_event.c | 3 +++
 src/include/utils/wait_event.h          | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf7625d988..9dc3978f35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2236,6 +2236,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>BaseBackupThrottle</literal></entry>
       <entry>Waiting during base backup when throttling activity.</entry>
      </row>
+     <row>
+      <entry><literal>CheckpointerWriteDelay</literal></entry>
+      <entry>Waiting between writes while performing a checkpoint.</entry>
+     </row>
      <row>
       <entry><literal>PgSleep</literal></entry>
       <entry>Waiting due to a call to <function>pg_sleep</function> or
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..9d9aad166e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -726,7 +726,10 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(100000L);
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+				  100,
+				  WAIT_EVENT_CHECKPOINT_WRITE_DELAY);
+		ResetLatch(MyLatch);
 	}
 	else if (--absorb_counter <= 0)
 	{
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60972c3a75..0706e922b5 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -485,6 +485,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_BASE_BACKUP_THROTTLE:
 			event_name = "BaseBackupThrottle";
 			break;
+		case WAIT_EVENT_CHECKPOINT_WRITE_DELAY:
+			event_name = "CheckpointWriteDelay";
+			break;
 		case WAIT_EVENT_PG_SLEEP:
 			event_name = "PgSleep";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 395d325c5f..d0345c6b49 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -141,6 +141,7 @@ typedef enum
 typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+	WAIT_EVENT_CHECKPOINT_WRITE_DELAY,
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-- 
2.35.1

#5Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#4)
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

Hi,

On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:

With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s
on my FreeBSD workstation!

That's impressive - wouldn't have guessed it to make that much of a difference. I assume that running the tests on freebsd for an older pg with a similar s_b & max_wal_size doesn't benefit as much? I wonder how much windows will improve.

It seems a little strange to introduce a new wait event that will very
often appear into a stable branch, but ... it is actually telling the
truth, so there is that.

In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD.

I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule...

The sleep/poll loop in RegisterSyncRequest() may also have another
problem. The comment explains that it was a deliberate choice not to
do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
think there's an excuse to ignore postmaster death in a loop that
presumably becomes infinite if the checkpointer exits. I guess we
could do:

-               pg_usleep(10000L);
+               WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
WAIT_EVENT_SYNC_REQUEST);

But... really, this should be waiting on a condition variable that the
checkpointer broadcasts on when the queue goes from full to not full,
no? Perhaps for master only?

Looks worth improving, but yes, I'd not do it in the back branches.

I do think it's worth giving that sleep a proper wait event though, even in the back branches.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#5)
3 attachment(s)
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

On Mon, Feb 28, 2022 at 2:36 PM Andres Freund <andres@anarazel.de> wrote:

On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:

It seems a little strange to introduce a new wait event that will very
often appear into a stable branch, but ... it is actually telling the
truth, so there is that.

In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD.

Yeah.

I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule...

I might look into this.

The sleep/poll loop in RegisterSyncRequest() may also have another
problem. The comment explains that it was a deliberate choice not to
do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
think there's an excuse to ignore postmaster death in a loop that
presumably becomes infinite if the checkpointer exits. I guess we
could do:

-               pg_usleep(10000L);
+               WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
WAIT_EVENT_SYNC_REQUEST);

But... really, this should be waiting on a condition variable that the
checkpointer broadcasts on when the queue goes from full to not full,
no? Perhaps for master only?

Looks worth improving, but yes, I'd not do it in the back branches.

0003 is a first attempt at that, for master only (on top of 0002 which
is the minimal fix). This shaves another second off
027_stream_regress.pl on my workstation. The main thing I realised is
that I needed to hold interrupts while waiting, which seems like it
should go away with 'tombstone' files as discussed in other threads.
That's not a new problem in this patch, it just looks more offensive
to the eye when you spell it out, instead of hiding it with an
unreported sleep/poll loop...

I do think it's worth giving that sleep a proper wait event though, even in the back branches.

I'm thinking that 0002 should be back-patched all the way, but 0001
could be limited to 14.

Attachments:

v2-0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Wake-up-for-latches-in-CheckpointWriteDelay.patchDownload
From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 28 Feb 2022 11:27:05 +1300
Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().

The checkpointer shouldn't ignore its latch.  Other backends may be
waiting for it to drain the request queue.  Hopefully real systems don't
have a full queue often, but the condition is reached easily when shared
buffers is very small.

This involves defining a new wait event, which will appear in the
pg_stat_activity view often due to spread checkpoints.

Back-patch only to 14.  Even though the problem exists in earlier
branches too, it's hard to hit there.  In 14 we stopped using signal
handlers for latches on Linux, *BSD and macOS, which were previously
hiding this problem by interrupting the sleep (though not reliably, as
the signal could arrive before the sleep begins; precisely the problem
latches address).

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml            | 4 ++++
 src/backend/postmaster/checkpointer.c   | 8 +++++++-
 src/backend/utils/activity/wait_event.c | 3 +++
 src/include/utils/wait_event.h          | 1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9fb62fec8e..8620aaddc7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2235,6 +2235,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>BaseBackupThrottle</literal></entry>
       <entry>Waiting during base backup when throttling activity.</entry>
      </row>
+     <row>
+      <entry><literal>CheckpointerWriteDelay</literal></entry>
+      <entry>Waiting between writes while performing a checkpoint.</entry>
+     </row>
      <row>
       <entry><literal>PgSleep</literal></entry>
       <entry>Waiting due to a call to <function>pg_sleep</function> or
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..a59c3cf020 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -484,6 +484,9 @@ CheckpointerMain(void)
 			}
 
 			ckpt_active = false;
+
+			/* We may have received an interrupt during the checkpoint. */
+			HandleCheckpointerInterrupts();
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -726,7 +729,10 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(100000L);
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+				  100,
+				  WAIT_EVENT_CHECKPOINT_WRITE_DELAY);
+		ResetLatch(MyLatch);
 	}
 	else if (--absorb_counter <= 0)
 	{
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60972c3a75..0706e922b5 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -485,6 +485,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_BASE_BACKUP_THROTTLE:
 			event_name = "BaseBackupThrottle";
 			break;
+		case WAIT_EVENT_CHECKPOINT_WRITE_DELAY:
+			event_name = "CheckpointWriteDelay";
+			break;
 		case WAIT_EVENT_PG_SLEEP:
 			event_name = "PgSleep";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 395d325c5f..d0345c6b49 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -141,6 +141,7 @@ typedef enum
 typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+	WAIT_EVENT_CHECKPOINT_WRITE_DELAY,
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-- 
2.30.2

v2-0002-Fix-waiting-in-RegisterSyncRequest.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Fix-waiting-in-RegisterSyncRequest.patchDownload
From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 11:38:27 +1300
Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().

If we run out of space in the checkpointer sync request queue (which is
hopefully rare on real systems, but common with very small buffer pool),
we wait for it to drain.  While waiting, we should report that as a wait
event so that users know what is going on, and also handle postmaster
death, since otherwise the loop might never terminate if the
checkpointer has exited.

Discussion: https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml            | 5 +++++
 src/backend/storage/sync/sync.c         | 4 +++-
 src/backend/utils/activity/wait_event.c | 3 +++
 src/include/utils/wait_event.h          | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8620aaddc7..71559442f0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2254,6 +2254,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting during recovery when WAL data is not available from any
        source (<filename>pg_wal</filename>, archive or stream).</entry>
      </row>
+     <row>
+      <entry><literal>RegisterSyncRequest</literal></entry>
+      <entry>Waiting while sending synchronization requests to the
+       checkpointer, because the request queue is full.</entry>
+     </row>
      <row>
       <entry><literal>VacuumDelay</literal></entry>
       <entry>Waiting in a cost-based vacuum delay point.</entry>
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..0c4d9ce687 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -31,6 +31,7 @@
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/md.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -606,7 +607,8 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 		if (ret || (!ret && !retryOnError))
 			break;
 
-		pg_usleep(10000L);
+		WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
+				  WAIT_EVENT_REGISTER_SYNC_REQUEST);
 	}
 
 	return ret;
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 0706e922b5..ff46a0e3c7 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -497,6 +497,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL:
 			event_name = "RecoveryRetrieveRetryInterval";
 			break;
+		case WAIT_EVENT_REGISTER_SYNC_REQUEST:
+			event_name = "RegisterSyncRequest";
+			break;
 		case WAIT_EVENT_VACUUM_DELAY:
 			event_name = "VacuumDelay";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index d0345c6b49..1c39ce031a 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -145,6 +145,7 @@ typedef enum
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
+	WAIT_EVENT_REGISTER_SYNC_REQUEST,
 	WAIT_EVENT_VACUUM_DELAY,
 	WAIT_EVENT_VACUUM_TRUNCATE
 } WaitEventTimeout;
-- 
2.30.2

v2-0003-Use-condition-variable-to-wait-when-sync-request-.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Use-condition-variable-to-wait-when-sync-request-.patchDownload
From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 17:34:43 +1300
Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
 is full.

Previously, in the (hopefully) rare case that we need to wait for the
checkpointer to create space in the sync request queue, we'd enter a
sleep/retry loop.  Instead, create a condition variable so the
checkpointer can wake us up whenever there is a transition from 'full'
to 'not full'.

Discussion:  https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml            | 10 +++---
 src/backend/postmaster/checkpointer.c   | 38 +++++++++++++++++++---
 src/backend/storage/sync/sync.c         | 42 ++++++++++---------------
 src/backend/utils/activity/wait_event.c |  6 ++--
 src/include/postmaster/bgwriter.h       |  4 ++-
 src/include/utils/wait_event.h          |  2 +-
 6 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 71559442f0..5e11548075 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1626,6 +1626,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for activity from a child process while
        executing a <literal>Gather</literal> plan node.</entry>
      </row>
+     <row>
+      <entry><literal>ForwardSyncRequest</literal></entry>
+      <entry>Waiting while sending synchronization requests to the
+       checkpointer, because the request queue is full.</entry>
+     </row>
      <row>
       <entry><literal>HashBatchAllocate</literal></entry>
       <entry>Waiting for an elected Parallel Hash participant to allocate a hash
@@ -2254,11 +2259,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting during recovery when WAL data is not available from any
        source (<filename>pg_wal</filename>, archive or stream).</entry>
      </row>
-     <row>
-      <entry><literal>RegisterSyncRequest</literal></entry>
-      <entry>Waiting while sending synchronization requests to the
-       checkpointer, because the request queue is full.</entry>
-     </row>
      <row>
       <entry><literal>VacuumDelay</literal></entry>
       <entry>Waiting in a cost-based vacuum delay point.</entry>
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a59c3cf020..16cf472f90 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -128,6 +128,7 @@ typedef struct
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
+	ConditionVariable requests_not_full_cv;	/* signaled when space available */
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -903,6 +904,7 @@ CheckpointerShmemInit(void)
 		CheckpointerShmem->max_requests = NBuffers;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
+		ConditionVariableInit(&CheckpointerShmem->requests_not_full_cv);
 	}
 }
 
@@ -1076,10 +1078,11 @@ RequestCheckpoint(int flags)
  * to perform its own fsync, which is far more expensive in practice.  It
  * is theoretically possible a backend fsync might still be necessary, if
  * the queue is full and contains no duplicate entries.  In that case, we
- * let the backend know by returning false.
+ * let the backend know by returning false, or if 'wait' is true, then we
+ * wait for space to become available.
  */
 bool
-ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type, bool wait)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
@@ -1101,9 +1104,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	 * backend will have to perform its own fsync request.  But before forcing
 	 * that to happen, we can try to compact the request queue.
 	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
+	if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
+		!CompactCheckpointerRequestQueue() &&
+		!wait)
 	{
 		/*
 		 * Count the subset of writes where backends have to do their own
@@ -1115,6 +1118,24 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		return false;
 	}
 
+	/*
+	 * If we still don't have enough space and the caller asked us to wait,
+	 * wait for the checkpointer to advertise that there is space.
+	 */
+	if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests)
+	{
+		ConditionVariablePrepareToSleep(&CheckpointerShmem->requests_not_full_cv);
+		while (CheckpointerShmem->num_requests >=
+			   CheckpointerShmem->max_requests)
+		{
+			LWLockRelease(CheckpointerCommLock);
+			ConditionVariableSleep(&CheckpointerShmem->requests_not_full_cv,
+								   WAIT_EVENT_FORWARD_SYNC_REQUEST);
+			LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+		}
+		ConditionVariableCancelSleep();
+	}
+
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
 	request->ftag = *ftag;
@@ -1261,6 +1282,7 @@ AbsorbSyncRequests(void)
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
 	int			n;
+	bool		was_full;
 
 	if (!AmCheckpointerProcess())
 		return;
@@ -1293,6 +1315,8 @@ AbsorbSyncRequests(void)
 		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
 	}
 
+	was_full = n >= CheckpointerShmem->max_requests;
+
 	START_CRIT_SECTION();
 
 	CheckpointerShmem->num_requests = 0;
@@ -1304,6 +1328,10 @@ AbsorbSyncRequests(void)
 
 	END_CRIT_SECTION();
 
+	/* Wake anyone waiting for space in ForwardSyncRequest(). */
+	if (was_full)
+		ConditionVariableBroadcast(&CheckpointerShmem->requests_not_full_cv);
+
 	if (requests)
 		pfree(requests);
 }
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 0c4d9ce687..d0fca02ee6 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -585,31 +585,23 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 		return true;
 	}
 
-	for (;;)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue a message in
-		 * retryOnError mode, we have to sleep and try again ... ugly, but
-		 * hopefully won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error in the case of SYNC_UNLINK_REQUEST would leave the
-		 * no-longer-used file still present on disk, which would be bad, so
-		 * I'm inclined to assume that the checkpointer will always empty the
-		 * queue soon.
-		 */
-		ret = ForwardSyncRequest(ftag, type);
-
-		/*
-		 * If we are successful in queueing the request, or we failed and were
-		 * instructed not to retry on error, break.
-		 */
-		if (ret || (!ret && !retryOnError))
-			break;
-
-		WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
-				  WAIT_EVENT_REGISTER_SYNC_REQUEST);
-	}
+	/*
+	 * Notify the checkpointer about it.  If we fail to queue a message in
+	 * retryOnError mode, we wait until space is available ... ugly, but
+	 * hopefully won't happen often.
+	 *
+	 * Don't allow interrupts while waiting.  Escaping with an error in the
+	 * case of SYNC_UNLINK_REQUEST would leave the no-longer-used file still
+	 * present on disk, which would be bad, so I'm inclined to assume that the
+	 * checkpointer will always empty the queue soon.
+	 *
+	 * XXX This concern would go away along with SYNC_UNLINK_REQUEST if we
+	 * figure out how to get rid of 'tombstone files'.  That would be a good
+	 * idea because holding interrupts is bad for ProcSignalBarrier.
+	 */
+	HOLD_INTERRUPTS();
+	ret = ForwardSyncRequest(ftag, type, retryOnError /* wait */);
+	RESUME_INTERRUPTS();
 
 	return ret;
 }
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e3c7..15c35691b6 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -346,6 +346,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_EXECUTE_GATHER:
 			event_name = "ExecuteGather";
 			break;
+		case WAIT_EVENT_FORWARD_SYNC_REQUEST:
+			event_name = "ForwardSyncRequest";
+			break;
 		case WAIT_EVENT_HASH_BATCH_ALLOCATE:
 			event_name = "HashBatchAllocate";
 			break;
@@ -497,9 +500,6 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL:
 			event_name = "RecoveryRetrieveRetryInterval";
 			break;
-		case WAIT_EVENT_REGISTER_SYNC_REQUEST:
-			event_name = "RegisterSyncRequest";
-			break;
 		case WAIT_EVENT_VACUUM_DELAY:
 			event_name = "VacuumDelay";
 			break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2882efd67b..fa82007fc0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -33,7 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
+extern bool ForwardSyncRequest(const FileTag *ftag,
+							   SyncRequestType type,
+							   bool wait);
 
 extern void AbsorbSyncRequests(void);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce031a..3b07c3b878 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -91,6 +91,7 @@ typedef enum
 	WAIT_EVENT_CHECKPOINT_DONE,
 	WAIT_EVENT_CHECKPOINT_START,
 	WAIT_EVENT_EXECUTE_GATHER,
+	WAIT_EVENT_FORWARD_SYNC_REQUEST,
 	WAIT_EVENT_HASH_BATCH_ALLOCATE,
 	WAIT_EVENT_HASH_BATCH_ELECT,
 	WAIT_EVENT_HASH_BATCH_LOAD,
@@ -145,7 +146,6 @@ typedef enum
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-	WAIT_EVENT_REGISTER_SYNC_REQUEST,
 	WAIT_EVENT_VACUUM_DELAY,
 	WAIT_EVENT_VACUUM_TRUNCATE
 } WaitEventTimeout;
-- 
2.30.2

#7Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#6)
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

Hi,

On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:

I do think it's worth giving that sleep a proper wait event though, even in the back branches.

I'm thinking that 0002 should be back-patched all the way, but 0001
could be limited to 14.

No strong opinion on back to where to backpatch. It'd be nice to have a proper
wait event everywhere, but especially < 12 it looks different enough to be
some effort.

From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 28 Feb 2022 11:27:05 +1300
Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().

LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
times :)

From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 11:38:27 +1300
Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().

LGTM.

From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 17:34:43 +1300
Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
is full.

Previously, in the (hopefully) rare case that we need to wait for the
checkpointer to create space in the sync request queue, we'd enter a
sleep/retry loop. Instead, create a condition variable so the
checkpointer can wake us up whenever there is a transition from 'full'
to 'not full'.

@@ -1076,10 +1078,11 @@ RequestCheckpoint(int flags)
* to perform its own fsync, which is far more expensive in practice.  It
* is theoretically possible a backend fsync might still be necessary, if
* the queue is full and contains no duplicate entries.  In that case, we
- * let the backend know by returning false.
+ * let the backend know by returning false, or if 'wait' is true, then we
+ * wait for space to become available.
*/
bool
-ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type, bool wait)
{
CheckpointerRequest *request;
bool		too_full;
@@ -1101,9 +1104,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
* backend will have to perform its own fsync request.  But before forcing
* that to happen, we can try to compact the request queue.
*/
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
+	if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
+		!CompactCheckpointerRequestQueue() &&
+		!wait)

Bit confused about the addition of the wait parameter / removal of the
CheckpointerShmem->checkpointer_pid check. What's that about?

+	/*
+	 * If we still don't have enough space and the caller asked us to wait,
+	 * wait for the checkpointer to advertise that there is space.
+	 */
+	if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests)
+	{
+		ConditionVariablePrepareToSleep(&CheckpointerShmem->requests_not_full_cv);
+		while (CheckpointerShmem->num_requests >=
+			   CheckpointerShmem->max_requests)
+		{
+			LWLockRelease(CheckpointerCommLock);
+			ConditionVariableSleep(&CheckpointerShmem->requests_not_full_cv,
+								   WAIT_EVENT_FORWARD_SYNC_REQUEST);
+			LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+		}
+		ConditionVariableCancelSleep();
+	}

Could there be a problem with a lot of backends trying to acquire
CheckpointerCommLock in exclusive mode? I'm inclined to think it's rare enough
to not worry.

Greetings,

Andres Freund

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#7)
Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

On Wed, Mar 2, 2022 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:

From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 28 Feb 2022 11:27:05 +1300
Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().

LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
times :)

Thanks for the review. Pushed to master and 14, with the wait event
moved to the end of the enum for the back-patch.

From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 11:38:27 +1300
Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().

LGTM.

Pushed as far back as 12. It could be done for 10 & 11, but indeed
the code starts getting quite different back there, and since there
are no field reports, I think that's OK for now.

A simple repro, for the record: run installcheck with
shared_buffers=256kB, and then partway through, kill -STOP
$checkpointer to simulate being stalled on IO for a while. Backends
will soon start waiting for the checkpointer to drain the queue while
dropping relations. This state was invisible to pg_stat_activity, and
hangs forever if you kill the postmaster and CONT the checkpointer.

From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Mar 2022 17:34:43 +1300
Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
is full.

[review]

I'll come back to 0003 (condition variable-based improvement) a bit later.