connection establishment versus parallel workers

Started by Nathan Bossartabout 1 year ago11 messages
#1Nathan Bossart
nathandbossart@gmail.com
1 attachment(s)

My team recently received a report about connection establishment times
increasing substantially from v16 onwards. Upon further investigation,
this seems to have something to do with commit 7389aad (which moved a lot
of postmaster code out of signal handlers) in conjunction with workloads
that generate many parallel workers. I've attached a set of reproduction
steps. The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
(>= 48).

Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
parallel worker processing much more responsive to the point of contending
with incoming connections, and that before this change, the kernel balanced
the execution of the signal handlers and ServerLoop() to prevent this. I
don't have a concrete proposal yet, but I thought it was still worth
starting a discussion. TBH I'm not sure we really need to do anything
since this arguably comes down to a trade-off between connection and worker
responsiveness.

--
nathan

Attachments:

repro.txttext/plain; charset=us-asciiDownload
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#1)
Re: connection establishment versus parallel workers

On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

My team recently received a report about connection establishment times
increasing substantially from v16 onwards. Upon further investigation,
this seems to have something to do with commit 7389aad (which moved a lot
of postmaster code out of signal handlers) in conjunction with workloads
that generate many parallel workers. I've attached a set of reproduction
steps. The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
(>= 48).

Interesting.

Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
parallel worker processing much more responsive to the point of contending
with incoming connections, and that before this change, the kernel balanced
the execution of the signal handlers and ServerLoop() to prevent this. I
don't have a concrete proposal yet, but I thought it was still worth
starting a discussion. TBH I'm not sure we really need to do anything
since this arguably comes down to a trade-off between connection and worker
responsiveness.

One factor is:

* Check if the latch is set already. If so, leave the loop
* immediately, avoid blocking again. We don't attempt to report any
* other events that might also be satisfied.

If we had a way to say "no really, gimme everything you have", I guess
that'd help. Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#2)
Re: connection establishment versus parallel workers

On Thu, Dec 12, 2024 at 11:36 AM Thomas Munro <thomas.munro@gmail.com> wrote:

... instead of
reporting only the lowest priority one)

s/priority/position/

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#2)
1 attachment(s)
Re: connection establishment versus parallel workers

On Thu, Dec 12, 2024 at 11:36 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
parallel worker processing much more responsive to the point of contending
with incoming connections, and that before this change, the kernel balanced
the execution of the signal handlers and ServerLoop() to prevent this. I
don't have a concrete proposal yet, but I thought it was still worth
starting a discussion. TBH I'm not sure we really need to do anything
since this arguably comes down to a trade-off between connection and worker
responsiveness.

One factor is:

* Check if the latch is set already. If so, leave the loop
* immediately, avoid blocking again. We don't attempt to report any
* other events that might also be satisfied.

Here's an experimental patch to try changing that policy. It improves
the connection times on my small computer with your test, but I doubt
I'm seeing the real issue. But in theory, assuming a backlog of
connections and workers to start, I think each server loop should be
able to accept and fork one client backend, and fork up to 100
(MAX_BGWORKERS_TO_LAUNCH) background workers.

Attachments:

v1-0001-Latches-shouldn-t-starve-socket-event-delivery.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Latches-shouldn-t-starve-socket-event-delivery.patchDownload
From 0444997d07342d733320743213cec736b0d008d0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 12 Dec 2024 23:46:50 +1300
Subject: [PATCH v1] Latches shouldn't starve socket event delivery.

If a WaitEventSetWait() caller wants multiple events, an already-set
latch shouldn't prevent other events from being reported at the same
time.  It should certainly prevent sleeping, but that just requires
using a zero timeout when polling the kernel.

The main caller of WaitEventSetWait() with nevents > 1 is the
postmaster.  Even if its latch is frequently set, we don't want socket
accept processing to be starved.

XXX experimental

Reported-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLOcxUa6m7UinPN1gZXFyr92L8btG_pGTHPiWY2YbRw2w%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c | 35 ++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f19ce5e7aff..67ace0964d7 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1452,9 +1452,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		int			rc;
 
 		/*
-		 * Check if the latch is set already. If so, leave the loop
-		 * immediately, avoid blocking again. We don't attempt to report any
-		 * other events that might also be satisfied.
+		 * Check if the latch is set already first.  If so, we either exit
+		 * immediately or ask the kernel for any more events available right
+		 * now without waiting, depending on how many events the caller wants.
 		 *
 		 * If someone sets the latch between this and the
 		 * WaitEventSetWaitBlock() below, the setter will write a byte to the
@@ -1499,7 +1499,16 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			/* could have been set above */
 			set->latch->maybe_sleeping = false;
 
-			break;
+			if (returned_events == nevents)
+				break;	/* output buffer full already */
+
+			/*
+			 * Even though we already have an event, we'll poll just once with
+			 * zero timeout to see what non-latch events we can fit into the
+			 * output buffer at the same time.
+			 */
+			cur_timeout = 0;
+			timeout = 0;
 		}
 
 		/*
@@ -1508,18 +1517,16 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		 * to retry, everything >= 1 is the number of returned events.
 		 */
 		rc = WaitEventSetWaitBlock(set, cur_timeout,
-								   occurred_events, nevents);
+								   occurred_events, nevents - returned_events);
 
-		if (set->latch)
-		{
-			Assert(set->latch->maybe_sleeping);
+		if (set->latch &&
+			set->latch->maybe_sleeping)
 			set->latch->maybe_sleeping = false;
-		}
 
 		if (rc == -1)
 			break;				/* timeout occurred */
 		else
-			returned_events = rc;
+			returned_events += rc;
 
 		/* If we're not done, update cur_timeout for next iteration */
 		if (returned_events == 0 && timeout >= 0)
@@ -1607,7 +1614,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* Drain the signalfd. */
 			drain();
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1766,7 +1773,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		if (cur_event->events == WL_LATCH_SET &&
 			cur_kqueue_event->filter == EVFILT_SIGNAL)
 		{
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1891,7 +1898,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* There's data in the self-pipe, clear it. */
 			drain();
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -2107,7 +2114,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			if (!ResetEvent(set->handles[cur_event->pos + 1]))
 				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
-- 
2.39.5

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Thomas Munro (#4)
Re: connection establishment versus parallel workers

On Fri, Dec 13, 2024 at 02:29:53AM +1300, Thomas Munro wrote:

Here's an experimental patch to try changing that policy. It improves
the connection times on my small computer with your test, but I doubt
I'm seeing the real issue. But in theory, assuming a backlog of
connections and workers to start, I think each server loop should be
able to accept and fork one client backend, and fork up to 100
(MAX_BGWORKERS_TO_LAUNCH) background workers.

Thanks for the quick response! I'm taking a look at the patch...

--
nathan

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#5)
2 attachment(s)
Re: connection establishment versus parallel workers

On Fri, Dec 13, 2024 at 11:00 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Fri, Dec 13, 2024 at 02:29:53AM +1300, Thomas Munro wrote:

Here's an experimental patch to try changing that policy. It improves
the connection times on my small computer with your test, but I doubt
I'm seeing the real issue. But in theory, assuming a backlog of
connections and workers to start, I think each server loop should be
able to accept and fork one client backend, and fork up to 100
(MAX_BGWORKERS_TO_LAUNCH) background workers.

Thanks for the quick response! I'm taking a look at the patch...

After sleeping on the let-a-hundred-workers-fork logic, I dug out
aa1351f1eec4 (2017) and read a couple of interesting things:

One could also question the wisdom of using an O(N^2) processing
method if the system is intended to support so many bgworkers.

So that's about the linear search for workers to queue up for
starting, in BackgroundWorkerStateChange(). Hmm. I suppose we should
only really process PM signals when we see WL_LATCH_SET, or I think
we'd run BackgroundWorkerStateChange() twice as often as before, when
the 0001 patch feeds you WL_LATCH_SET and WL_SOCKET_ACCEPT at the same
time, under your backlog-of-all-kinds-of-events workload. Maybe it
doesn't really matter much, but it'd be an unintended behavioural
change, so here is a patch. You could call it a revert of 1/4 of
239b1753. I wonder if that has any problematic side effects for other
PM signal kinds, but I'm looking and I can't see them. And:

There is talk of rewriting the postmaster to use a WaitEventSet to
avoid the signal-response-delay problem, but I'd argue that this change
should be kept even after that happens (if it ever does).

It did happen, and we did keep that. Someone might want to research
that policy, which gives massive preference to workers over clients
(the 100 is more of a cap, in practice it means "start all pending
workers", a bit like how child exit means "reap all dead children",
while we only accept one socket per server socket per server loop),
but no one has complained in half a decade, until, if I intuited
correctly, this report of what I assume is the unbounded socket event
starvation state. On the other hand, you only said that v16 was worse
than v15, not that v15 was actually good! The workload could easily
have been suffering from undiagnosed and unwanted massive bias on v15
already. And starting a lot of workers might be an even more serious
issue if your system is very slow at forking, or if your system
doesn't even have fork() and if PostgreSQL doesn't have threads.

But like 04a09ee9, for me this is not really about a resource
allocation policy that we could debate and perhaps adjust separately.
The main question is whether complete starvation (DOS) is possible, or
whether the postmaster should always try to service all of its
conceptual work queues within bounded delays, no matter what is
happening on the others. I think it should.

0001 patch is unchanged, 0002 patch sketches out a response to the
observation a couple of paragraphs above.

Attachments:

v2-0001-Latches-shouldn-t-starve-socket-event-delivery.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Latches-shouldn-t-starve-socket-event-delivery.patchDownload
From 1d87696494834087c3391ae172088a8f7f5b5799 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 12 Dec 2024 23:46:50 +1300
Subject: [PATCH v2 1/2] Latches shouldn't starve socket event delivery.

If a WaitEventSetWait() caller wants multiple events, an already-set
latch shouldn't prevent other events from being reported at the same
time.  It should certainly prevent sleeping, but that just requires
using a zero timeout when polling the kernel.

The main caller of WaitEventSetWait() with nevents > 1 is the
postmaster.  Even if its latch is frequently set, we don't want socket
accept processing to be starved.

XXX experimental

Reported-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLOcxUa6m7UinPN1gZXFyr92L8btG_pGTHPiWY2YbRw2w%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c | 35 ++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f19ce5e7aff..67ace0964d7 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1452,9 +1452,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		int			rc;
 
 		/*
-		 * Check if the latch is set already. If so, leave the loop
-		 * immediately, avoid blocking again. We don't attempt to report any
-		 * other events that might also be satisfied.
+		 * Check if the latch is set already first.  If so, we either exit
+		 * immediately or ask the kernel for any more events available right
+		 * now without waiting, depending on how many events the caller wants.
 		 *
 		 * If someone sets the latch between this and the
 		 * WaitEventSetWaitBlock() below, the setter will write a byte to the
@@ -1499,7 +1499,16 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			/* could have been set above */
 			set->latch->maybe_sleeping = false;
 
-			break;
+			if (returned_events == nevents)
+				break;	/* output buffer full already */
+
+			/*
+			 * Even though we already have an event, we'll poll just once with
+			 * zero timeout to see what non-latch events we can fit into the
+			 * output buffer at the same time.
+			 */
+			cur_timeout = 0;
+			timeout = 0;
 		}
 
 		/*
@@ -1508,18 +1517,16 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		 * to retry, everything >= 1 is the number of returned events.
 		 */
 		rc = WaitEventSetWaitBlock(set, cur_timeout,
-								   occurred_events, nevents);
+								   occurred_events, nevents - returned_events);
 
-		if (set->latch)
-		{
-			Assert(set->latch->maybe_sleeping);
+		if (set->latch &&
+			set->latch->maybe_sleeping)
 			set->latch->maybe_sleeping = false;
-		}
 
 		if (rc == -1)
 			break;				/* timeout occurred */
 		else
-			returned_events = rc;
+			returned_events += rc;
 
 		/* If we're not done, update cur_timeout for next iteration */
 		if (returned_events == 0 && timeout >= 0)
@@ -1607,7 +1614,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* Drain the signalfd. */
 			drain();
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1766,7 +1773,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		if (cur_event->events == WL_LATCH_SET &&
 			cur_kqueue_event->filter == EVFILT_SIGNAL)
 		{
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1891,7 +1898,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* There's data in the self-pipe, clear it. */
 			drain();
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -2107,7 +2114,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			if (!ResetEvent(set->handles[cur_event->pos + 1]))
 				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
-- 
2.47.0

v2-0002-Adjust-pending_pm_pmsignal-processing-priority.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Adjust-pending_pm_pmsignal-processing-priority.patchDownload
From bf0e1069cb6898f24a5607bdf1d0291f1f351be1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 13 Dec 2024 01:11:01 +1300
Subject: [PATCH v2 2/2] Adjust pending_pm_pmsignal processing priority.

Commit 239b175 gave high priority to shutdown requests, child exit
notifications, reload requests and PM signals received from backends.

Undo that last case.  While the others have good reasons for
queue-jumping (supremacy of administrative commands, ordering
requirement of reload-connect sequences, and desire to release child
process resources ASAP), PM signals do not.  Let's go back to processing
PM signals only when a WL_LATCH_SET event is received, and thus only
once per server loop.

It didn't make much difference before, because we'd very likely run the
inner loop that scan events[] just once, because only WL_LATCH_SET was
reported in the common case.  Now that we've adjusted WaitEventSetWait()
to return WL_LATCH_SET and potentially WL_SOCKET_ACCEPT at the same time
when the latch was already set, let's reconsider that.  We don't want to
run BackgroundWorkerStateChange() and similar PM-signal-processing code
twice as often as before in high load scenarios.

XXX experimental

Reported-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLOcxUa6m7UinPN1gZXFyr92L8btG_pGTHPiWY2YbRw2w%40mail.gmail.com
---
 src/backend/postmaster/postmaster.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f0f9c66487c..11e7a938f9f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1657,7 +1657,8 @@ ServerLoop(void)
 			 * didn't see WL_LATCH_SET.  This gives high priority to shutdown
 			 * and reload requests where the latch happens to appear later in
 			 * events[] or will be reported by a later call to
-			 * WaitEventSetWait().
+			 * WaitEventSetWait().  We also want to release child resources as
+			 * soon as as possible, before accepting sockets.
 			 */
 			if (pending_pm_shutdown_request)
 				process_pm_shutdown_request();
@@ -1665,8 +1666,17 @@ ServerLoop(void)
 				process_pm_reload_request();
 			if (pending_pm_child_exit)
 				process_pm_child_exit();
-			if (pending_pm_pmsignal)
-				process_pm_pmsignal();
+
+			/*
+			 * When receiving high frequency PM signals, we only want to
+			 * process them once per server loop, not once per event of any
+			 * kind.  They are neither very high priority, nor causally linked
+			 * to socket events that might arrive out of order, so we defer
+			 * processing until we see WL_LATCH_SET.
+			 */
+			if (events[i].events & WL_LATCH_SET)
+				if (pending_pm_pmsignal)
+					process_pm_pmsignal();
 
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
-- 
2.47.0

#7Nathan Bossart
nathandbossart@gmail.com
In reply to: Thomas Munro (#6)
Re: connection establishment versus parallel workers

Sorry for the delay, and thanks again for digging into this.

On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:

0001 patch is unchanged, 0002 patch sketches out a response to the
observation a couple of paragraphs above.

Both of these patches seem to improve matters quite a bit. I haven't yet
thought too deeply about it all, but upon a skim, your patches seem
entirely reasonable to me.

However, while this makes the test numbers for >= v16 look more like those
for v15, we're also seeing a big jump from v13 to v14. This bisects pretty
cleanly to commit d872510. I haven't figured out _why_ this commit is
impacting this particular test, but I figured I'd at least update the
thread with what we know so far.

--
nathan

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#7)
Re: connection establishment versus parallel workers

On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote:

On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:

0001 patch is unchanged, 0002 patch sketches out a response to the
observation a couple of paragraphs above.

Both of these patches seem to improve matters quite a bit. I haven't yet
thought too deeply about it all, but upon a skim, your patches seem
entirely reasonable to me.

I gave these a closer look, and I still feel that they are both
straightforward and reasonable. IIUC the main open question is whether
this might cause problems for other PM signal kinds. Like you, I don't see
anything immediately obvious there, but I'll admit I'm not terribly
familiar with the precise characteristics of postmaster signals. In any
case, 0001 feels pretty safe to me.

However, while this makes the test numbers for >= v16 look more like those
for v15, we're also seeing a big jump from v13 to v14. This bisects pretty
cleanly to commit d872510. I haven't figured out _why_ this commit is
impacting this particular test, but I figured I'd at least update the
thread with what we know so far.

I regrettably have no updates on this one, yet.

--
nathan

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#8)
Re: connection establishment versus parallel workers

On Tue, Jan 14, 2025 at 8:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote:

On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:

0001 patch is unchanged, 0002 patch sketches out a response to the
observation a couple of paragraphs above.

Both of these patches seem to improve matters quite a bit. I haven't yet
thought too deeply about it all, but upon a skim, your patches seem
entirely reasonable to me.

I gave these a closer look, and I still feel that they are both
straightforward and reasonable. IIUC the main open question is whether
this might cause problems for other PM signal kinds. Like you, I don't see
anything immediately obvious there, but I'll admit I'm not terribly
familiar with the precise characteristics of postmaster signals. In any
case, 0001 feels pretty safe to me.

Cool. Thanks. I'll think about what else could be affected by that
change as you say, and if nothing jumps out I'll go ahead and commit
them, back to 16.

I have done a lot more study of this problem and was about to write in
with some more patches to propose for master only. Basically that
"100" is destroying performance in this workload, which at least on my
machine hardly gets any parallelism at all, and only in sporadic
bursts. You can argue that we aren't designed for high frequency
short-lived workers (we'll have to reuse workers in some way to be
good at that), but I don't think it has to fail as badly as it does
today. It falls off a cliff instead of plateauing: we are so busy
forking that we don't get around to reaping children, so all our slots
are (artificially) used up most of the time, and the queries that do
manage to nab one then sit on their hands for a long time at query
end. "1" gets much smoother results, but as prophesied in aa1351f1,
the complexity is terrible, possibly even O(n^3) in places depending
on how you count: there are many places that scan the whole worker
list, and one that even scans it again for each item, and that is for
each thing that starts. IOW we have to fix the complexity
fundamentally. I have a WIP patch that adds a couple of work queues,
so that the postmaster never has to consider anything more than the
head of a queue in various places. More soon...

However, while this makes the test numbers for >= v16 look more like those
for v15, we're also seeing a big jump from v13 to v14. This bisects pretty
cleanly to commit d872510. I haven't figured out _why_ this commit is
impacting this particular test, but I figured I'd at least update the
thread with what we know so far.

I regrettably have no updates on this one, yet.

My first thought was that the catalogues needed for connection might
be getting evicted, but the data size seems too small for that surely
and you'd probably have picked it up immediately from wait events.
Weird.

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#9)
3 attachment(s)
Re: connection establishment versus parallel workers

On Tue, Jan 14, 2025 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jan 14, 2025 at 8:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I gave these a closer look, and I still feel that they are both
straightforward and reasonable. IIUC the main open question is whether
this might cause problems for other PM signal kinds. Like you, I don't see
anything immediately obvious there, but I'll admit I'm not terribly
familiar with the precise characteristics of postmaster signals. In any
case, 0001 feels pretty safe to me.

Cool. Thanks. I'll think about what else could be affected by that
change as you say, and if nothing jumps out I'll go ahead and commit
them, back to 16.

I pushed 0001, addressing the main problem.

I think 0002 described and addressed a real phenomenon but only when
you have multiple sockets with non-empty listen queues. If we fixed
the real underlying problems it wouldn't be an issue. I decided to
unsee that for now.

I have done a lot more study of this problem and was about to write in
with some more patches to propose for master only. Basically that
"100" is destroying performance in this workload, which at least on my
machine hardly gets any parallelism at all, and only in sporadic
bursts. You can argue that we aren't designed for high frequency
short-lived workers (we'll have to reuse workers in some way to be
good at that), but I don't think it has to fail as badly as it does
today. It falls off a cliff instead of plateauing: we are so busy
forking that we don't get around to reaping children, so all our slots
are (artificially) used up most of the time, and the queries that do
manage to nab one then sit on their hands for a long time at query
end. "1" gets much smoother results, but as prophesied in aa1351f1,
the complexity is terrible, possibly even O(n^3) in places depending
on how you count: there are many places that scan the whole worker
list, and one that even scans it again for each item, and that is for
each thing that starts. IOW we have to fix the complexity
fundamentally. I have a WIP patch that adds a couple of work queues,
so that the postmaster never has to consider anything more than the
head of a queue in various places. More soon...

Here's the WIP code I have up with for that so far.

Remaining opportunities not attempted:
1. When a child exits, we could use a hash table to find it by pid.
2. When looking for a bgworker slot that is not in use, we could do
something better than linear search.

Attachments:

0001-Remove-BackgroundWorkerStateChange-s-outer-loop.patchtext/x-patch; charset=US-ASCII; name=0001-Remove-BackgroundWorkerStateChange-s-outer-loop.patchDownload
From 3a323e9eda8df187961ba5efc2b0d63af9520cfe Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Dec 2024 16:37:38 +1300
Subject: [PATCH 1/3] Remove BackgroundWorkerStateChange()'s outer loop.

Previously, if a backend had modified a background worker slot and sent
PMSIGNAL_BACKGROUND_WORKER_CHANGE, then the postmaster would scan all
slots to find what changed.  Add a circular change queue.  The
requesting backend inserts the index of the changed slot into the queue,
and the postmaster consumes it from there directly.  The values are
range-checked so that the postmaster is no less robust against rogue
backends trashing memory.

Discussion: https://postgr.es/m/Z1n5UpAiGDmFcMmd%40nathan
---
 src/backend/postmaster/bgworker.c | 203 +++++++++++++++++++++++-------
 1 file changed, 160 insertions(+), 43 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index b288915cec8..0de0422bb53 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -96,9 +96,26 @@ typedef struct BackgroundWorkerArray
 	int			total_slots;
 	uint32		parallel_register_count;
 	uint32		parallel_terminate_count;
+
+	/*
+	 * A circular queue of slot change notifications.  Backends write slot
+	 * numbers into this queue at the "head" position to tell the postmaster
+	 * that a slot is now being used, or should now be terminated.
+	 */
+	int		   *change_queue;
+	uint32		change_queue_tail;	/* postmaster advances (consumer) */
+	uint32		change_queue_head;	/* backends advance (producers) */
+
 	BackgroundWorkerSlot slot[FLEXIBLE_ARRAY_MEMBER];
 } BackgroundWorkerArray;
 
+/*
+ * Each slot can have at most two unconsumed events produced by a backend:
+ * when in_use is set, and when terminate is set.  We need one extra element
+ * because tail==head means empty.
+ */
+#define BGWORKER_CHANGE_QUEUE_SIZE (max_worker_processes * 2 + 1)
+
 struct BackgroundWorkerHandle
 {
 	int			slot;
@@ -151,6 +168,7 @@ BackgroundWorkerShmemSize(void)
 	size = offsetof(BackgroundWorkerArray, slot);
 	size = add_size(size, mul_size(max_worker_processes,
 								   sizeof(BackgroundWorkerSlot)));
+	size = add_size(size, mul_size(BGWORKER_CHANGE_QUEUE_SIZE, sizeof(int)));
 
 	return size;
 }
@@ -175,6 +193,15 @@ BackgroundWorkerShmemInit(void)
 		BackgroundWorkerData->parallel_register_count = 0;
 		BackgroundWorkerData->parallel_terminate_count = 0;
 
+		/*
+		 * Set up the change notification queue.  The queue itself is after
+		 * slots[] in memory.
+		 */
+		BackgroundWorkerData->change_queue = (int *)
+			&BackgroundWorkerData->slot[max_worker_processes];
+		BackgroundWorkerData->change_queue_head = 0;
+		BackgroundWorkerData->change_queue_tail = 0;
+
 		/*
 		 * Copy contents of worker list into shared memory.  Record the shared
 		 * memory slot assigned to each worker.  This ensures a 1-to-1
@@ -195,6 +222,10 @@ BackgroundWorkerShmemInit(void)
 			rw->rw_shmem_slot = slotno;
 			rw->rw_worker.bgw_notify_pid = 0;	/* might be reinit after crash */
 			memcpy(&slot->worker, &rw->rw_worker, sizeof(BackgroundWorker));
+
+			/* Enqueue change notifications for all populated slots. */
+			BackgroundWorkerData->change_queue[BackgroundWorkerData->change_queue_head++] = slotno;
+
 			++slotno;
 		}
 
@@ -208,6 +239,7 @@ BackgroundWorkerShmemInit(void)
 			slot->in_use = false;
 			++slotno;
 		}
+
 	}
 	else
 		Assert(found);
@@ -235,7 +267,7 @@ FindRegisteredWorkerBySlotNumber(int slotno)
 }
 
 /*
- * Notice changes to shared memory made by other backends.
+ * Process state change notifications enqueued by backends.
  * Accept new worker requests only if allow_new_workers is true.
  *
  * This code runs in the postmaster, so we must be very careful not to assume
@@ -245,41 +277,80 @@ FindRegisteredWorkerBySlotNumber(int slotno)
 void
 BackgroundWorkerStateChange(bool allow_new_workers)
 {
-	int			slotno;
+	uint32		tail;
+	uint32		head;
+
+	Assert(!IsUnderPostmaster);
 
 	/*
-	 * The total number of slots stored in shared memory should match our
-	 * notion of max_worker_processes.  If it does not, something is very
-	 * wrong.  Further down, we always refer to this value as
-	 * max_worker_processes, in case shared memory gets corrupted while we're
-	 * looping.
+	 * The queue head might be advanced concurrently, but in that case we'll
+	 * receive another PMSIGNAL_BACKGROUND_WORKER_CHANGE notification.  So we
+	 * take one snapshot of the current values, and process only that range.
+	 * The signal handler is assumed to have been a memory barrier, so we see
+	 * the new head.  The tail is only written by the postmaster.
 	 */
-	if (max_worker_processes != BackgroundWorkerData->total_slots)
+	tail = BackgroundWorkerData->change_queue_tail;
+	head = BackgroundWorkerData->change_queue_head;
+
+	/* Sanity check queue range. */
+	if (tail >= BGWORKER_CHANGE_QUEUE_SIZE ||
+		head >= BGWORKER_CHANGE_QUEUE_SIZE)
 	{
+		/* This corruption isn't recoverable, so just squawk. */
 		ereport(LOG,
-				(errmsg("inconsistent background worker state (\"max_worker_processes\"=%d, total slots=%d)",
-						max_worker_processes,
-						BackgroundWorkerData->total_slots)));
+				(errmsg("background worker change queue range out of bounds (tail=%u, head=%u, size=%d)",
+						tail, head, BGWORKER_CHANGE_QUEUE_SIZE)));
 		return;
 	}
 
 	/*
-	 * Iterate through slots, looking for newly-registered workers or workers
-	 * who must die.
+	 * Make sure that we can load change_queue[] values and referenced slot[]
+	 * values contents that were stored before head was stored.  Pairs with
+	 * EnqueueBackgroundWorkerStateChange().
 	 */
-	for (slotno = 0; slotno < max_worker_processes; ++slotno)
+	pg_read_barrier();
+
+	/*
+	 * Consume from slot change queue, looking for newly-registered workers or
+	 * workers who must die.
+	 */
+	while (tail != head)
 	{
-		BackgroundWorkerSlot *slot = &BackgroundWorkerData->slot[slotno];
+		int			slotno = BackgroundWorkerData->change_queue[tail];
+		BackgroundWorkerSlot *slot;
 		RegisteredBgWorker *rw;
 
-		if (!slot->in_use)
+		/*
+		 * Advance tail.  It doesn't really need to be in shared memory
+		 * because we reserved space to make overflow impossible, but we put
+		 * it there so that EnqueueBackgroundWorkerStateChange() can make an
+		 * assertion check.
+		 */
+		if (++tail == BGWORKER_CHANGE_QUEUE_SIZE)
+			tail = 0;
+		BackgroundWorkerData->change_queue_tail = tail;
+
+		/* Sanity check slot index. */
+		if (slotno < 0 || slotno >= max_worker_processes)
+		{
+			ereport(LOG,
+					(errmsg("background worker change queue contains out of range slot number %d",
+							slotno)));
 			continue;
+		}
 
 		/*
-		 * Make sure we don't see the in_use flag before the updated slot
-		 * contents.
+		 * We only expect to receive change notifications for slots that are
+		 * in use.
 		 */
-		pg_read_barrier();
+		slot = &BackgroundWorkerData->slot[slotno];
+		if (!slot->in_use)
+		{
+			ereport(LOG,
+					(errmsg("received background worker change request for slot %d but it is not in use",
+							slotno)));
+			continue;
+		}
 
 		/* See whether we already know about this worker. */
 		rw = FindRegisteredWorkerBySlotNumber(slotno);
@@ -312,6 +383,21 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 		if (!allow_new_workers)
 			slot->terminate = true;
 
+		/*
+		 * Try to allocate space for the registration data in the registered
+		 * workers list, or log and treat as terminated immediately.
+		 */
+		rw = MemoryContextAllocExtended(PostmasterContext,
+										sizeof(RegisteredBgWorker),
+										MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
+		if (rw == NULL)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory")));
+			slot->terminate = true;
+		}
+
 		/*
 		 * If the worker is marked for termination, we don't need to add it to
 		 * the registered workers list; we can just free the slot. However, if
@@ -339,21 +425,10 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 			if (notify_pid != 0)
 				kill(notify_pid, SIGUSR1);
 
-			continue;
-		}
+			if (rw)
+				pfree(rw);
 
-		/*
-		 * Copy the registration data into the registered workers list.
-		 */
-		rw = MemoryContextAllocExtended(PostmasterContext,
-										sizeof(RegisteredBgWorker),
-										MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
-		if (rw == NULL)
-		{
-			ereport(LOG,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
-			return;
+			continue;
 		}
 
 		/*
@@ -415,6 +490,48 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 	}
 }
 
+/*
+ * Enqueue a change notification for a background worker slot.  The caller is
+ * reponsible for sending PMSIGNAL_BACKGROUND_WORKER_CHANGE, and serializing
+ * insertions with BackgroundWorkerLock.  The memory barriers pair with
+ * BackgroundWorkerStateChange() in the postmaster, which consumes from the
+ * queue without taking the lock.
+ */
+static inline void
+EnqueueBackgroundWorkerStateChange(int slotno)
+{
+	uint32		head;
+
+	Assert(IsUnderPostmaster);
+	Assert(LWLockHeldByMe(BackgroundWorkerLock));
+
+	/* Make sure slot contents are stored before queue contents. */
+	pg_write_barrier();
+
+	/* Write the slot number at head position in the queue. */
+	head = BackgroundWorkerData->change_queue_head;
+	Assert(head < BGWORKER_CHANGE_QUEUE_SIZE);
+	BackgroundWorkerData->change_queue[head] = slotno;
+
+	/* Make sure the queue entry is stored before the new head. */
+	pg_write_barrier();
+	if (++head == BGWORKER_CHANGE_QUEUE_SIZE)
+		head = 0;
+	BackgroundWorkerData->change_queue_head = head;
+
+	/*
+	 * It is impossible for the head to crash into the tail, because we
+	 * reserved space for the is_use and terminate state changes, and no other
+	 * changes are possible until the slot is recycled by the postmaster. This
+	 * barrier pairs with the memory barrier issued by the postmaster before
+	 * it clears in_use.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	pg_read_barrier();
+	Assert(head != BackgroundWorkerData->change_queue_tail);
+#endif
+}
+
 /*
  * Forget about a background worker that's no longer needed.
  *
@@ -1104,14 +1221,8 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
 			generation = slot->generation;
 			if (parallel)
 				BackgroundWorkerData->parallel_register_count++;
-
-			/*
-			 * Make sure postmaster doesn't see the slot as in use before it
-			 * sees the new contents.
-			 */
-			pg_write_barrier();
-
 			slot->in_use = true;
+			EnqueueBackgroundWorkerStateChange(slotno);
 			success = true;
 			break;
 		}
@@ -1301,16 +1412,22 @@ TerminateBackgroundWorker(BackgroundWorkerHandle *handle)
 	Assert(handle->slot < max_worker_processes);
 	slot = &BackgroundWorkerData->slot[handle->slot];
 
-	/* Set terminate flag in shared memory, unless slot has been reused. */
+	/*
+	 * Set terminate flag in shared memory, unless slot has been reused, or a
+	 * backend has already set terminate.  The latter check avoids overflowing
+	 * the chagne queue in the unlikely event of many calls to terminate the
+	 * same worker (see BGWORKER_CHANGE_QUEUE_SIZE).
+	 */
 	LWLockAcquire(BackgroundWorkerLock, LW_EXCLUSIVE);
-	if (handle->generation == slot->generation)
+	if (handle->generation == slot->generation && !slot->terminate)
 	{
 		slot->terminate = true;
+		EnqueueBackgroundWorkerStateChange(handle->slot);
 		signal_postmaster = true;
 	}
 	LWLockRelease(BackgroundWorkerLock);
 
-	/* Make sure the postmaster notices the change to shared memory. */
+	/* Tell postmaster to process state changes */
 	if (signal_postmaster)
 		SendPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE);
 }
-- 
2.47.1

0002-Remove-BackgroundWorkerStateChange-s-inner-loop.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-BackgroundWorkerStateChange-s-inner-loop.patchDownload
From 5b41df0db817970ca2278a6354cc0790cd565a0c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Dec 2024 17:44:54 +1300
Subject: [PATCH 2/3] Remove BackgroundWorkerStateChange()'s inner loop.

Previously, when processing a slot that might have changed, the
postmaster would scan the list of registered background workers looking
for the one using that slot.

Maintain a table in postmaster private memory that can be used to find
entries by slot number.

Discussion: https://postgr.es/m/Z1n5UpAiGDmFcMmd%40nathan
---
 src/backend/postmaster/bgworker.c | 43 ++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 0de0422bb53..3e326e48540 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -39,6 +39,13 @@
  */
 dlist_head	BackgroundWorkerList = DLIST_STATIC_INIT(BackgroundWorkerList);
 
+/*
+ * An array of registered background workers indexed by slot number.
+ * BackgroundWorkerList entries are also in this table for fast lookup, and
+ * must be kept in sync.
+ */
+static RegisteredBgWorker **BackgroundWorkerTable;
+
 /*
  * BackgroundWorkerSlots exist in shared memory and can be accessed (via
  * the BackgroundWorkerArray) by both the postmaster and by regular backends.
@@ -202,6 +209,19 @@ BackgroundWorkerShmemInit(void)
 		BackgroundWorkerData->change_queue_head = 0;
 		BackgroundWorkerData->change_queue_tail = 0;
 
+		/*
+		 * Allocate an array to let us find RegisteredBgWorker objects by slot
+		 * number.
+		 *
+		 * XXX Where else could this allocation happen?
+		 */
+		if (PostmasterContext && !BackgroundWorkerTable)
+			BackgroundWorkerTable =
+				MemoryContextAllocExtended(PostmasterContext,
+										   sizeof(BackgroundWorkerTable[0]) *
+										   max_worker_processes,
+										   MCXT_ALLOC_ZERO);
+
 		/*
 		 * Copy contents of worker list into shared memory.  Record the shared
 		 * memory slot assigned to each worker.  This ensures a 1-to-1
@@ -223,6 +243,9 @@ BackgroundWorkerShmemInit(void)
 			rw->rw_worker.bgw_notify_pid = 0;	/* might be reinit after crash */
 			memcpy(&slot->worker, &rw->rw_worker, sizeof(BackgroundWorker));
 
+			/* Add to slot-number lookup table. */
+			BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
+
 			/* Enqueue change notifications for all populated slots. */
 			BackgroundWorkerData->change_queue[BackgroundWorkerData->change_queue_head++] = slotno;
 
@@ -252,18 +275,11 @@ BackgroundWorkerShmemInit(void)
 static RegisteredBgWorker *
 FindRegisteredWorkerBySlotNumber(int slotno)
 {
-	dlist_iter	iter;
+	RegisteredBgWorker *rw = BackgroundWorkerTable[slotno];
 
-	dlist_foreach(iter, &BackgroundWorkerList)
-	{
-		RegisteredBgWorker *rw;
-
-		rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
-		if (rw->rw_shmem_slot == slotno)
-			return rw;
-	}
+	Assert(!rw || rw->rw_shmem_slot == slotno);
 
-	return NULL;
+	return rw;
 }
 
 /*
@@ -487,6 +503,9 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 								 rw->rw_worker.bgw_name)));
 
 		dlist_push_head(&BackgroundWorkerList, &rw->rw_lnode);
+
+		/* Add to slot-number lookup table. */
+		BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
 	}
 }
 
@@ -551,6 +570,10 @@ ForgetBackgroundWorker(RegisteredBgWorker *rw)
 	slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
 	Assert(slot->in_use);
 
+	/* Remove from slot-number lookup table. */
+	Assert(BackgroundWorkerTable[rw->rw_shmem_slot] == rw);
+	BackgroundWorkerTable[rw->rw_shmem_slot] = NULL;
+
 	/*
 	 * We need a memory barrier here to make sure that the update of
 	 * parallel_terminate_count completes before the store to in_use.
-- 
2.47.1

0003-Remove-loops-over-BackgroundWorkerList.patchtext/x-patch; charset=US-ASCII; name=0003-Remove-loops-over-BackgroundWorkerList.patchDownload
From 40f77c2e714aa6adf39952cb9fecfcc19330c7b8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 9 Jan 2025 10:37:05 +1300
Subject: [PATCH 3/3] Remove loops over BackgroundWorkerList.

In maybe_start_bgworkers(), the postmaster would loop over
BackgroundWorkerList, possibly starting up to 100 workers.  In
DetermineSleepTime(), it would loop again to compute a rarely needed
nap time.

The reason for starting so many workers in each server loop was to try
to mitigate the O(n^2) or worse logic caused by the data structures used
to track workers (see commit aa1351f1).  Unfortunately it also delays
processing of new socket connections and child exit events, and also
create artificial shortages of backend and worker slots.

Address the book-keeping complexity directly, by introducing three new
queues organized in processing order.  Non-running workers sit in one of
them, according to their state:

  (1) ready to start now,
  (2) waiting for a postmaster state change, or
  (3) waiting for a future time to be restarted.

Now the postmaster's main loop only has to handle the head item in the
start and time delay queues, or, more commonly, see that they are empty.

XXX Work in progress!

Discussion: https://postgr.es/m/Z1n5UpAiGDmFcMmd%40nathan
---
 src/backend/postmaster/bgworker.c           |  45 ++-
 src/backend/postmaster/postmaster.c         | 343 +++++++++-----------
 src/include/postmaster/bgworker_internals.h |   8 +-
 3 files changed, 198 insertions(+), 198 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 3e326e48540..b06db0fb152 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -39,6 +39,19 @@
  */
 dlist_head	BackgroundWorkerList = DLIST_STATIC_INIT(BackgroundWorkerList);
 
+/*
+ * A background worker that is not running is also in one of these queues
+ * waiting to start, using rw_queue_node.  These queues represent workers that
+ * should be started immediately, after a future postmaster state is reached,
+ * or after a specific time in the future, respectively.
+ */
+dlist_head	BackgroundWorkerStartQueue =
+DLIST_STATIC_INIT(BackgroundWorkerStartQueue);
+dlist_head	BackgroundWorkerWaitStateQueue =
+DLIST_STATIC_INIT(BackgroundWorkerWaitStateQueue);
+dlist_head	BackgroundWorkerWaitTimeQueue =
+DLIST_STATIC_INIT(BackgroundWorkerWaitTimeQueue);
+
 /*
  * An array of registered background workers indexed by slot number.
  * BackgroundWorkerList entries are also in this table for fast lookup, and
@@ -246,8 +259,11 @@ BackgroundWorkerShmemInit(void)
 			/* Add to slot-number lookup table. */
 			BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
 
-			/* Enqueue change notifications for all populated slots. */
-			BackgroundWorkerData->change_queue[BackgroundWorkerData->change_queue_head++] = slotno;
+			/*
+			 * Schedule to start.  It'll be requeued if it wants to wait for a
+			 * different state.
+			 */
+			dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
 
 			++slotno;
 		}
@@ -493,7 +509,6 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 
 		/* Initialize postmaster bookkeeping. */
 		rw->rw_pid = 0;
-		rw->rw_crashed_at = 0;
 		rw->rw_shmem_slot = slotno;
 		rw->rw_terminate = false;
 
@@ -502,10 +517,14 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 				(errmsg_internal("registering background worker \"%s\"",
 								 rw->rw_worker.bgw_name)));
 
+		/* Put it on the list of all workers. */
 		dlist_push_head(&BackgroundWorkerList, &rw->rw_lnode);
 
 		/* Add to slot-number lookup table. */
 		BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
+
+		/* Add to the start queue. */
+		dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
 	}
 }
 
@@ -588,6 +607,8 @@ ForgetBackgroundWorker(RegisteredBgWorker *rw)
 			(errmsg_internal("unregistering background worker \"%s\"",
 							 rw->rw_worker.bgw_name)));
 
+	if (!dlist_node_is_detached(&rw->rw_queue_node))
+		dlist_delete_thoroughly(&rw->rw_queue_node);
 	dlist_delete(&rw->rw_lnode);
 	pfree(rw);
 }
@@ -726,6 +747,9 @@ ResetBackgroundWorkerCrashTimes(void)
 
 		rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
 
+		/* We waited for them all to exit, so they should have no pid. */
+		Assert(rw->rw_pid == 0);
+
 		if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
 		{
 			/*
@@ -748,18 +772,20 @@ ResetBackgroundWorkerCrashTimes(void)
 			 */
 			Assert((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0);
 
-			/*
-			 * Allow this worker to be restarted immediately after we finish
-			 * resetting.
-			 */
-			rw->rw_crashed_at = 0;
-
 			/*
 			 * If there was anyone waiting for it, they're history.
 			 */
 			rw->rw_worker.bgw_notify_pid = 0;
 		}
 	}
+
+	/* Remove everything from queues. */
+	while (!dlist_is_empty(&BackgroundWorkerStartQueue))
+		dlist_delete_thoroughly(dlist_head_node(&BackgroundWorkerStartQueue));
+	while (!dlist_is_empty(&BackgroundWorkerWaitStateQueue))
+		dlist_delete_thoroughly(dlist_head_node(&BackgroundWorkerWaitStateQueue));
+	while (!dlist_is_empty(&BackgroundWorkerWaitTimeQueue))
+		dlist_delete_thoroughly(dlist_head_node(&BackgroundWorkerWaitTimeQueue));
 }
 
 /*
@@ -1165,7 +1191,6 @@ RegisterBackgroundWorker(BackgroundWorker *worker)
 
 	rw->rw_worker = *worker;
 	rw->rw_pid = 0;
-	rw->rw_crashed_at = 0;
 	rw->rw_terminate = false;
 
 	dlist_push_head(&BackgroundWorkerList, &rw->rw_lnode);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..1041c4b3686 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -371,10 +371,6 @@ static bool avlauncher_needs_signal = false;
 /* received START_WALRECEIVER signal */
 static bool WalReceiverRequested = false;
 
-/* set when there's a worker that needs to be started up */
-static bool StartWorkerNeeded = true;
-static bool HaveCrashedWorker = false;
-
 /* set when signals arrive */
 static volatile sig_atomic_t pending_pm_pmsignal;
 static volatile sig_atomic_t pending_pm_child_exit;
@@ -429,13 +425,16 @@ static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
-static void maybe_start_bgworkers(void);
+static void maybe_start_bgworker(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
 static void StartAutovacuumWorker(void);
 static bool StartBackgroundWorker(RegisteredBgWorker *rw);
 static void InitPostmasterDeathWatchHandle(void);
+static void bgworker_requeue_on_state_change(void);
+static int	bgworker_requeue_on_time(void);
+static void bgworker_schedule_restart(RegisteredBgWorker *rw);
 
 #ifdef WIN32
 #define WNOHANG 0				/* ignored, so any integer value will do */
@@ -1371,9 +1370,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupStatus = STARTUP_RUNNING;
 	UpdatePMState(PM_STARTUP);
 
-	/* Some workers may be scheduled to start now */
-	maybe_start_bgworkers();
-
 	status = ServerLoop();
 
 	/*
@@ -1520,14 +1516,10 @@ checkControlFile(void)
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup = 0;
+	int			sleep_ms;
 
-	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
-	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	/* If we're in a shutdown sequence, we ignore bgworkers altogether. */
+	if (Shutdown > NoShutdown)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1543,54 +1535,18 @@ DetermineSleepTime(void)
 			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
-		return 0;
-
-	if (HaveCrashedWorker)
-	{
-		dlist_mutable_iter iter;
-
-		/*
-		 * When there are crashed bgworkers, we sleep just long enough that
-		 * they are restarted when they request to be.  Scan the list to
-		 * determine the minimum of all wakeup times according to most recent
-		 * crash time and requested restart interval.
-		 */
-		dlist_foreach_modify(iter, &BackgroundWorkerList)
-		{
-			RegisteredBgWorker *rw;
-			TimestampTz this_wakeup;
-
-			rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
-
-			if (rw->rw_crashed_at == 0)
-				continue;
-
-			if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART
-				|| rw->rw_terminate)
-			{
-				ForgetBackgroundWorker(rw);
-				continue;
-			}
-
-			this_wakeup = TimestampTzPlusMilliseconds(rw->rw_crashed_at,
-													  1000L * rw->rw_worker.bgw_restart_time);
-			if (next_wakeup == 0 || this_wakeup < next_wakeup)
-				next_wakeup = this_wakeup;
-		}
-	}
-
-	if (next_wakeup != 0)
-	{
-		int			ms;
+	/*
+	 * Check if any deferred workers should be moved to the start queue.  This
+	 * also tells us how long until the next one is ready, or INT_MAX if there
+	 * are none.
+	 */
+	sleep_ms = bgworker_requeue_on_time();
 
-		/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
-		ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-												   next_wakeup);
-		return Min(60 * 1000, ms);
-	}
+	/* Don't sleep at all if there is a worker ready to be started. */
+	if (!dlist_is_empty(&BackgroundWorkerStartQueue))
+		return 0;
 
-	return 60 * 1000;
+	return Min(60 * 1000, sleep_ms);
 }
 
 /*
@@ -2313,13 +2269,6 @@ process_pm_child_exit(void)
 			UpdatePMState(PM_RUN);
 			connsAllowed = true;
 
-			/*
-			 * At the next iteration of the postmaster's main loop, we will
-			 * crank up the background tasks like the autovacuum launcher and
-			 * background workers that were not started earlier already.
-			 */
-			StartWorkerNeeded = true;
-
 			/* at this point we are really open for business */
 			ereport(LOG,
 					(errmsg("database system is ready to accept connections")));
@@ -2615,6 +2564,8 @@ CleanupBackend(PMChild *bp,
 
 	if (crashed)
 	{
+		if (rw)
+			rw->rw_pid = 0;
 		HandleChildCrash(bp_pid, exitstatus, procname);
 		return;
 	}
@@ -2635,19 +2586,18 @@ CleanupBackend(PMChild *bp,
 	 */
 	if (bp_bkend_type == B_BG_WORKER)
 	{
-		if (!EXIT_STATUS_0(exitstatus))
-		{
-			/* Record timestamp, so we know when to restart the worker. */
-			rw->rw_crashed_at = GetCurrentTimestamp();
-		}
-		else
-		{
-			/* Zero exit status means terminate */
-			rw->rw_crashed_at = 0;
+		rw->rw_pid = 0;
+
+		/*
+		 * Zero exist status means terminate.  Otherwise we schedule a delayed
+		 * restart.  Note that ReportBackgroundWorkerExit() will free the
+		 * worker if appropriate.
+		 */
+		if (EXIT_STATUS_0(exitstatus))
 			rw->rw_terminate = true;
-		}
+		else
+			bgworker_schedule_restart(rw);
 
-		rw->rw_pid = 0;
 		ReportBackgroundWorkerExit(rw); /* report child death */
 
 		if (!logged)
@@ -2656,9 +2606,6 @@ CleanupBackend(PMChild *bp,
 						 procname, bp_pid, exitstatus);
 			logged = true;
 		}
-
-		/* have it be restarted */
-		HaveCrashedWorker = true;
 	}
 
 	if (!logged)
@@ -3144,8 +3091,7 @@ pmstate_name(PMState state)
 }
 
 /*
- * Simple wrapper for updating pmState. The main reason to have this wrapper
- * is that it makes it easy to log all state transitions.
+ * Update pmState.
  */
 static void
 UpdatePMState(PMState newState)
@@ -3153,6 +3099,9 @@ UpdatePMState(PMState newState)
 	elog(DEBUG1, "updating PMState from %s to %s",
 		 pmstate_name(pmState), pmstate_name(newState));
 	pmState = newState;
+
+	/* Some bgworkers might need to move to the start queue. */
+	bgworker_requeue_on_state_change();
 }
 
 /*
@@ -3265,8 +3214,7 @@ LaunchMissingBackgroundProcesses(void)
 		WalSummarizerPMChild = StartChildProcess(B_WAL_SUMMARIZER);
 
 	/* Get other worker processes running, if needed */
-	if (StartWorkerNeeded || HaveCrashedWorker)
-		maybe_start_bgworkers();
+	maybe_start_bgworker();
 }
 
 /*
@@ -3621,9 +3569,6 @@ process_pm_pmsignal(void)
 
 		UpdatePMState(PM_HOT_STANDBY);
 		connsAllowed = true;
-
-		/* Some workers may be scheduled to start now */
-		StartWorkerNeeded = true;
 	}
 
 	/* Process background worker state changes. */
@@ -3631,7 +3576,6 @@ process_pm_pmsignal(void)
 	{
 		/* Accept new worker requests only if not stopping. */
 		BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS);
-		StartWorkerNeeded = true;
 	}
 
 	/* Tell syslogger to rotate logfile if requested */
@@ -3954,7 +3898,8 @@ StartBackgroundWorker(RegisteredBgWorker *rw)
 		ereport(LOG,
 				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
 				 errmsg("no slot available for new background worker process")));
-		rw->rw_crashed_at = GetCurrentTimestamp();
+		bgworker_schedule_restart(rw);
+		ReportBackgroundWorkerExit(rw);
 		return false;
 	}
 	bn->rw = rw;
@@ -3975,8 +3920,8 @@ StartBackgroundWorker(RegisteredBgWorker *rw)
 		/* undo what AssignPostmasterChildSlot did */
 		ReleasePostmasterChildSlot(bn);
 
-		/* mark entry as crashed, so we'll try again later */
-		rw->rw_crashed_at = GetCurrentTimestamp();
+		bgworker_schedule_restart(rw);
+		ReportBackgroundWorkerExit(rw);
 		return false;
 	}
 
@@ -4026,48 +3971,128 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 }
 
 /*
- * If the time is right, start background worker(s).
- *
- * As a side effect, the bgworker control variables are set or reset
- * depending on whether more workers may need to be started.
- *
- * We limit the number of workers started per call, to avoid consuming the
- * postmaster's attention for too long when many such requests are pending.
- * As long as StartWorkerNeeded is true, ServerLoop will not block and will
- * call this function again after dealing with any other issues.
+ * Schedule deferred workers to start if the state they are waiting for has
+ * arrived.  This should be called whenever pmState moves to states for which
+ * bgworker_should_start_now() can return true.
  */
 static void
-maybe_start_bgworkers(void)
+bgworker_requeue_on_state_change(void)
 {
-#define MAX_BGWORKERS_TO_LAUNCH 100
-	int			num_launched = 0;
-	TimestampTz now = 0;
 	dlist_mutable_iter iter;
 
+	dlist_foreach_modify(iter, &BackgroundWorkerWaitStateQueue)
+	{
+		RegisteredBgWorker *rw;
+
+		rw = dlist_container(RegisteredBgWorker, rw_queue_node, iter.cur);
+		if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+		{
+			dlist_delete(&rw->rw_queue_node);
+			dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
+		}
+	}
+}
+
+/*
+ * Schedule workers to start if their start time has arrived.  Returns the
+ * number of milliseconds until the next time-delayed worker wants to start,
+ * or INT_MAX if there are none.
+ */
+static int
+bgworker_requeue_on_time(void)
+{
+	dlist_head *queue = &BackgroundWorkerWaitTimeQueue;
+
+	if (!dlist_is_empty(queue))
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		do
+		{
+			RegisteredBgWorker *rw;
+
+			/* If the head item in the queue needs more time, we are done. */
+			rw = dlist_head_element(RegisteredBgWorker, rw_queue_node, queue);
+			if (rw->rw_restart_at > now)
+				return TimestampDifferenceMilliseconds(now, rw->rw_restart_at);
+
+			/* Move this one to the start queue, and go around again. */
+			dlist_delete(&rw->rw_queue_node);
+			dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
+		} while (!dlist_is_empty(queue));
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * After a background worker exits unsuccessfully or fails to start, this
+ * function can be used to schedule a restart attempt after "bgw_restart_time"
+ * seconds.
+ */
+static void
+bgworker_schedule_restart(RegisteredBgWorker *rw)
+{
+	dlist_head *queue = &BackgroundWorkerWaitTimeQueue;
+	RegisteredBgWorker *other = NULL;
+
+	if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
+		return;
+
+	/* Compute restart time. */
+	rw->rw_restart_at = TimestampTzPlusSeconds(GetCurrentTimestamp(),
+											   rw->rw_worker.bgw_restart_time);
+
 	/*
-	 * During crash recovery, we have no need to be called until the state
-	 * transition out of recovery.
+	 * Find insertion point in the time-order queue.  Start the search at the
+	 * tail end, since it has the highest time so far and there is a good
+	 * chance we have a higher time.
+	 *
+	 * XXX We could use a binary heap instead, but this should be very
+	 * infrequent and the list should usually be empty, so it doesn't really
+	 * matter.  The more important thing is that bgworker_requeue_on_time() is
+	 * O(1), since it is called frequently.
 	 */
-	if (FatalError)
+	if (!dlist_is_empty(queue))
 	{
-		StartWorkerNeeded = false;
-		HaveCrashedWorker = false;
-		return;
+		other = dlist_tail_element(RegisteredBgWorker,
+								   rw_queue_node,
+								   queue);
+		while (other && other->rw_restart_at > rw->rw_restart_at)
+		{
+			if (!dlist_has_prev(queue, &other->rw_queue_node))
+				other = NULL;
+			else
+				other =
+					dlist_container(RegisteredBgWorker,
+									rw_queue_node,
+									dlist_prev_node(queue,
+													&other->rw_queue_node));
+		}
 	}
 
-	/* Don't need to be called again unless we find a reason for it below */
-	StartWorkerNeeded = false;
-	HaveCrashedWorker = false;
+	if (other)
+		dlist_insert_after(&other->rw_queue_node, &rw->rw_queue_node);
+	else
+		dlist_push_head(queue, &rw->rw_queue_node);
+}
 
-	dlist_foreach_modify(iter, &BackgroundWorkerList)
+/*
+ * Start at most one background worker from BackgroundWorkerStartList.
+ */
+static void
+maybe_start_bgworker(void)
+{
+	while (!dlist_is_empty(&BackgroundWorkerStartQueue))
 	{
 		RegisteredBgWorker *rw;
 
-		rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
+		/* pop thoroughly (so we can tell we're not on any list) */
+		rw = dlist_head_element(RegisteredBgWorker, rw_queue_node,
+								&BackgroundWorkerStartQueue);
+		dlist_delete_thoroughly(&rw->rw_queue_node);
 
-		/* ignore if already running */
-		if (rw->rw_pid != 0)
-			continue;
+		Assert(rw->rw_pid == 0);
 
 		/* if marked for death, clean up and remove from list */
 		if (rw->rw_terminate)
@@ -4077,76 +4102,22 @@ maybe_start_bgworkers(void)
 		}
 
 		/*
-		 * If this worker has crashed previously, maybe it needs to be
-		 * restarted (unless on registration it specified it doesn't want to
-		 * be restarted at all).  Check how long ago did a crash last happen.
-		 * If the last crash is too recent, don't start it right away; let it
-		 * be restarted once enough time has passed.
+		 * If this worker is waiting for a different postmaster state, then
+		 * requeue it.
 		 */
-		if (rw->rw_crashed_at != 0)
+		if (!bgworker_should_start_now(rw->rw_worker.bgw_start_time))
 		{
-			if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
-			{
-				int			notify_pid;
-
-				notify_pid = rw->rw_worker.bgw_notify_pid;
-
-				ForgetBackgroundWorker(rw);
-
-				/* Report worker is gone now. */
-				if (notify_pid != 0)
-					kill(notify_pid, SIGUSR1);
-
-				continue;
-			}
-
-			/* read system time only when needed */
-			if (now == 0)
-				now = GetCurrentTimestamp();
-
-			if (!TimestampDifferenceExceeds(rw->rw_crashed_at, now,
-											rw->rw_worker.bgw_restart_time * 1000))
-			{
-				/* Set flag to remember that we have workers to start later */
-				HaveCrashedWorker = true;
-				continue;
-			}
+			dlist_push_tail(&BackgroundWorkerWaitStateQueue,
+							&rw->rw_queue_node);
+			continue;
 		}
 
-		if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
-		{
-			/* reset crash time before trying to start worker */
-			rw->rw_crashed_at = 0;
-
-			/*
-			 * Try to start the worker.
-			 *
-			 * On failure, give up processing workers for now, but set
-			 * StartWorkerNeeded so we'll come back here on the next iteration
-			 * of ServerLoop to try again.  (We don't want to wait, because
-			 * there might be additional ready-to-run workers.)  We could set
-			 * HaveCrashedWorker as well, since this worker is now marked
-			 * crashed, but there's no need because the next run of this
-			 * function will do that.
-			 */
-			if (!StartBackgroundWorker(rw))
-			{
-				StartWorkerNeeded = true;
-				return;
-			}
-
-			/*
-			 * If we've launched as many workers as allowed, quit, but have
-			 * ServerLoop call us again to look for additional ready-to-run
-			 * workers.  There might not be any, but we'll find out the next
-			 * time we run.
-			 */
-			if (++num_launched >= MAX_BGWORKERS_TO_LAUNCH)
-			{
-				StartWorkerNeeded = true;
-				return;
-			}
-		}
+		/*
+		 * Only try to start one worker, and then return control so that other
+		 * kinds of postmaster work can be processed.
+		 */
+		StartBackgroundWorker(rw);
+		break;
 	}
 }
 
diff --git a/src/include/postmaster/bgworker_internals.h b/src/include/postmaster/bgworker_internals.h
index 092b1610663..c58daff10de 100644
--- a/src/include/postmaster/bgworker_internals.h
+++ b/src/include/postmaster/bgworker_internals.h
@@ -33,13 +33,17 @@ typedef struct RegisteredBgWorker
 {
 	BackgroundWorker rw_worker; /* its registry entry */
 	pid_t		rw_pid;			/* 0 if not running */
-	TimestampTz rw_crashed_at;	/* if not 0, time it last crashed */
 	int			rw_shmem_slot;
 	bool		rw_terminate;
-	dlist_node	rw_lnode;		/* list link */
+	dlist_node	rw_lnode;		/* node for list of all workers */
+	dlist_node	rw_queue_node;	/* node for start/wait queues */
+	TimestampTz rw_restart_at;	/* deferred start time after failure */
 } RegisteredBgWorker;
 
 extern PGDLLIMPORT dlist_head BackgroundWorkerList;
+extern PGDLLIMPORT dlist_head BackgroundWorkerStartQueue;
+extern PGDLLIMPORT dlist_head BackgroundWorkerWaitStateQueue;
+extern PGDLLIMPORT dlist_head BackgroundWorkerWaitTimeQueue;
 
 extern Size BackgroundWorkerShmemSize(void);
 extern void BackgroundWorkerShmemInit(void);
-- 
2.47.1

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#10)
Re: connection establishment versus parallel workers

On Mon, Jan 20, 2025 at 6:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's the WIP code I have up with for that so far.

Remaining opportunities not attempted:
1. When a child exits, we could use a hash table to find it by pid.
2. When looking for a bgworker slot that is not in use, we could do
something better than linear search.

I haven't had time to work on this again due to other projects, but I
wanted to write down the ideas I thought about for the record.
Obviously we'd want some kind of free list, but the postmaster would
need to push free slots into it when workers exit, and it can't use
locks or any data structures that can cause it to get stuck just
because a backend has corrupted shared memory. It must always be able
to process shutdown commands and coordinate crash restarts, no matter
how bananas the backends go. With that in mind:

Idea #1: A CAS-based linked list, relying on CAS never being emulated
with locks, and probably requiring 16 bit indexes since we can only
expect 32 bit atomic hardware and you might need to change both the
head and tail of a hypothetical list head when going from empty to one
element. But you have to convince yourself that it's OK to run a
CAS-loop in the postmaster, which might in theory might be prevented
from completing...

Idea #2: The free list could be a simple circular buffer of slot
index numbers. That would be symmetrical with
0001-Remove-BackgroundWorkerStateChange-s-outer-loop.patch's shared
memory "start" queue, and could be coded essentially the same way.
The "start" queue and the "free" queue would then both be simple
arrays with a head and a tail, and in both cases only the consumers
(regular backends) need an lwlock to serialise against each other,
while the producer (the postmaster) can get away with careful memory
barriers and just has to range-check the head/tail indexes to deal
with untrusted shared memory contents. A rogue backend can jam up the
bgworker subsystem, but that's already true. It still can't prevent
shutdown or crash restart.

Idea #3: Suggested by Robert in an off-list chat about all this:
backends could maintain a shared memory free-list using existing dlist
technology protected by an lwlock. When it's empty *they* (not the
postmaster) would fill it up again using a linear slot search. It's
simple but not quite as satisfying, because it is still possible to
degrade to high frequency linear searches that only find a small
number of free slots each time if you're unlucky, ie you can entirely
fail to amortise.

Idea #4: Maintain a bitmap of free slots indexes, which the
postmaster sets with atomic_fetch_or().

I lean towards idea #2, but haven't actually tried it.

A similar situation exists for DSM slot management, though that has
different complications: no postmaster interaction, but funky handle
requirements due to portability concerns. I think the handles could
probably be changed to encode the slot index + generation for O(1)
lookup at "attach" time, and free slots could be stored in a circular
queue for O(1) slot allocation. I think the use of random numbers
stemmed from SysV shared memory's need to find a free key in a 32 bit
OS-wide namespace (yuck). I haven't looked at that code in a while
but I don't recall any reason why even those couldn't be hidden inside
the slot itself, instead of being exposed in the handle, forcing
linear searches. A generation scheme would also be more robust
against weird random number collisions, and could detect handles that
were valid but now are no longer in a more obvious way, instead of "I
looked everywhere and I couldn't find it".