Refactoring postmaster's code to cleanup after child exit

Started by Heikki Linnakangasalmost 2 years ago48 messageshackers

heikki.linnakangas@enterprisedb.com

almost 2 years ago

Reading through postmaster code, I spotted some refactoring
opportunities to make it slightly more readable.

Currently, when a child process exits, the postmaster first scans
through BackgroundWorkerList to see if it was a bgworker process. If not
found, it scans through the BackendList to see if it was a backend
process (which it really should be then). That feels a bit silly,
because every running background worker process also has an entry in
BackendList. There's a lot of duplication between
CleanupBackgroundWorker and CleanupBackend.

Before commit 8a02b3d732, we used to created Backend entries only for
background worker processes that connected to a database, not for other
background worker processes. I think that's why we have the code
structure we have. But now that we have a Backend entry for all bgworker
processes, it's more natural to have single function to deal with both
regular backends and bgworkers.

So I came up with the attached patches. This doesn't make any meaningful
user-visible changes, except for some incidental changes in log messages
(see commit message for details).

--
Heikki Linnakangas
Neon (https://neon.tech)

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Heikki Linnakangas (#1)

Re: Refactoring postmaster's code to cleanup after child exit

On 06/07/2024 22:01, Heikki Linnakangas wrote:

Reading through postmaster code, I spotted some refactoring
opportunities to make it slightly more readable.

Currently, when a child process exits, the postmaster first scans
through BackgroundWorkerList to see if it was a bgworker process. If not
found, it scans through the BackendList to see if it was a backend
process (which it really should be then). That feels a bit silly,
because every running background worker process also has an entry in
BackendList. There's a lot of duplication between
CleanupBackgroundWorker and CleanupBackend.

Before commit 8a02b3d732, we used to created Backend entries only for
background worker processes that connected to a database, not for other
background worker processes. I think that's why we have the code
structure we have. But now that we have a Backend entry for all bgworker
processes, it's more natural to have single function to deal with both
regular backends and bgworkers.

So I came up with the attached patches. This doesn't make any meaningful
user-visible changes, except for some incidental changes in log messages
(see commit message for details).

New patch version attached. Fixed conflicts with recent commits, no
other changes.

--
Heikki Linnakangas
Neon (https://neon.tech)

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Heikki Linnakangas (#2)

Re: Refactoring postmaster's code to cleanup after child exit

I committed the first two trivial patches, and have continued to work on
postmaster.c, and how it manages all the child processes.

This is a lot of patches. They're built on top of each other, because
that's the order I developed them in, but they probably could be applied
in different order too. Please help me by reviewing these, before the
stack grows even larger :-). Even partial reviews would be very helpful.
I suggest to start reading them in order, and when you get tired, just
send any comments you have up to that point.

* v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

This is the same refactoring patch I started this thread with.

* v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch
* v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

Little refactoring of how postmaster launches the background processes.

* v3-0005-Add-test-for-connection-limits.patch
* v3-0006-Add-test-for-dead-end-backends.patch

A few new TAP tests for dead-end backends and enforcing connection
limits. We didn't have much coverage for these before.

* v3-0007-Use-an-shmem_exit-callback-to-remove-backend-from.patch
* v3-0008-Introduce-a-separate-BackendType-for-dead-end-chi.patch

Some preliminary refactoring towards patch
v3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch

* v3-0009-Kill-dead-end-children-when-there-s-nothing-else-.patch

I noticed that we never send SIGTERM or SIGQUIT to dead-end backends,
which seems silly. If the server is shutting down, dead-end backends
might prevent the shutdown from completing. Dead-end backends will
expire after authentication_timoeut (default 60s), so it won't last for
too long, but still seems like we should kill dead-end backends if
they're the only children preventing shutdown from completing.

* 3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch

This is what I consider the main patch in this series. Currently, only
regular backens, bgworkers and autovacuum workers have a PMChildFlags
slot, which is used to detect when a postmaster child exits in an
unclean way (in addition to the exit code). This patch assigns a child
slot for all processes, except for dead-end backends. That includes all
the aux processes.

While we're at it, I created separate pools of child slots for different
kinds of backends, which fixes the issue that opening a lot of client
connections can exhaust all the slots, so that background workers or
autovacuum workers cannot start either [1]/messages/by-id/55d2f50c-0b81-4b33-b202-cd2a406d69a3@iki.fi.

[1]: /messages/by-id/55d2f50c-0b81-4b33-b202-cd2a406d69a3@iki.fi
/messages/by-id/55d2f50c-0b81-4b33-b202-cd2a406d69a3@iki.fi

* v3-0011-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch

One more little refactoring, to pass MyPMChildSlot to the child process
differently.

Where is all this leading? I'm not sure exactly, but having a postmaster
child slot for every postmaster child seems highly useful. We could move
the ProcSignal machinery to use those slot numbers for the indexes to
the ProcSignal array, instead of ProcSignal, for example. That would
allow all processes to participate in the signalling, even before they
have a PGPROC entry. (Or with Thomas's interrupts refactoring, the
interrupts array). With the multithreading work, PMChild struct could
store a thread id, or whatever is needed for threads to communicate with
each other. In any case, seems like it will come handy.

--
Heikki Linnakangas
Neon (https://neon.tech)

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#3)

Re: Refactoring postmaster's code to cleanup after child exit

On Fri, Aug 2, 2024 at 11:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

* v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

LGTM.

[v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch]

Currently, when a child process exits, the postmaster first scans
through BackgroundWorkerList, to see if it the child process was a
background worker. If not found, then it scans through BackendList to
see if it was a regular backend. That leads to some duplication
between the bgworker and regular backend cleanup code, as both have an
entry in the BackendList that needs to be cleaned up in the same way.
Refactor that so that we scan just the BackendList to find the child
process, and if it was a background worker, do the additional
bgworker-specific cleanup in addition to the normal Backend cleanup.

Makes sense.

On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
now logged with that exit code, instead of 0. Also, if a bgworker
exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
restarted. Previously it was treated as a normal exit.

Interesting. So when that error was first specially handled in this thread:

/messages/by-id/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i+M@mail.gmail.com

... it went from being considered a crash, to being considered like
exit(0). It's true that shared memory can't be corrupted by a process
that never enters main(), but it's better not to hide the true reason
for the failure (if it is still possible -- I don't find many
references to that phenomenon in recent times). Clobbering exitstatus
with 0 doesn't seem right at all, now that we have background workers
whose restart behaviour is affected by that.

If a child process is not found in the BackendList, the log message
now calls it "untracked child process" rather than "server process".
Arguably that should be a PANIC, because we do track all the child
processes in the list, so failing to find a child process is highly
unexpected. But if we want to change that, let's discuss and do that
as a separate commit.

Yeah, it would be highly unexpected if waitpid() told you about some
random other process (or we screwed up the bookkeeping and didn't
recognise it). So at least having a different message seems good.

* v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch

* v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

Much of the code in process_pm_child_exit() to launch replacement
processes when one exits or when progressing to next postmaster state
was unnecessary, because the ServerLoop will launch any missing
background processes anyway. Remove the redundant code and let
ServerLoop handle it.

+1, makes sense.

In ServerLoop, move the code to launch all the processes to a new
subroutine, to group it all together.

+1, makes sense.

More soon...

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Thomas Munro (#4)

Re: Refactoring postmaster's code to cleanup after child exit

On 08/08/2024 13:47, Thomas Munro wrote:

On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
now logged with that exit code, instead of 0. Also, if a bgworker
exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
restarted. Previously it was treated as a normal exit.

Interesting. So when that error was first specially handled in this thread:

/messages/by-id/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i+M@mail.gmail.com

... it went from being considered a crash, to being considered like
exit(0). It's true that shared memory can't be corrupted by a process
that never enters main(), but it's better not to hide the true reason
for the failure (if it is still possible -- I don't find many
references to that phenomenon in recent times). Clobbering exitstatus
with 0 doesn't seem right at all, now that we have background workers
whose restart behaviour is affected by that.

I adjusted this ERROR_WAIT_NO_CHILDREN a little more, to avoid logging
the death of the child twice in some cases.

* v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch

+1

Committed the patches up to and including this one, with tiny comment
changes.

* v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

Much of the code in process_pm_child_exit() to launch replacement
processes when one exits or when progressing to next postmaster state
was unnecessary, because the ServerLoop will launch any missing
background processes anyway. Remove the redundant code and let
ServerLoop handle it.

I'm going to work a little more on the comments on this one before
committing; I had just moved all the "If we have lost the XXX, try to
start a new one" comments as is, but they look pretty repetitive now.

Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Heikki Linnakangas (#5)

Re: Refactoring postmaster's code to cleanup after child exit

On 10/08/2024 00:13, Heikki Linnakangas wrote:

On 08/08/2024 13:47, Thomas Munro wrote:

* v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

     Much of the code in process_pm_child_exit() to launch replacement
     processes when one exits or when progressing to next postmaster
state
     was unnecessary, because the ServerLoop will launch any missing
     background processes anyway. Remove the redundant code and let
     ServerLoop handle it.

I'm going to work a little more on the comments on this one before
committing; I had just moved all the "If we have lost the XXX, try to
start a new one" comments as is, but they look pretty repetitive now.

Pushed this now, after adjusting the comments a bit. Thanks again for
the review!

Here are the remaining patches, rebased.

commit a1c43d65907d20a999b203e465db1277ec842a0a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu Aug 1 17:24:12 2024 +0300

Introduce a separate BackendType for dead-end children

And replace postmaster.c's own "backend type" codes with BackendType

XXX: While working on this, many times I accidentally did something
like "foo |= B_SOMETHING" instead of "foo |= 1 << B_SOMETHING", when
constructing arguments to SignalSomeChildren or CountChildren, and
things broke in very subtle ways taking a long time to debug. The old
constants that were already bitmasks avoided that. Maybe we need some
macro magic or something to make this less error-prone.

While rebasing this today, I spotted another instance of that mistake
mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)"
instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make
that less error-prone:

1. Add a separate typedef for the bitmasks, and macros/functions to work
with it. Something like:

typedef struct {
uint32 mask;
} BackendTypeMask;

static const BackendTypeMask BTMASK_ALL = { 0xffffffff };
static const BackendTypeMask BTMASK_NONE = { 0 };

static inline BackendTypeMask
BTMASK_ADD(BackendTypeMask mask, BackendType t)
{
mask.mask |= 1 << t;
return mask;
}

static inline BackendTypeMask
BTMASK_DEL(BackendTypeMask mask, BackendType t)
{
mask.mask &= ~(1 << t);
return mask;
}

Now the compiler will complain if you try to pass a BackendType for the
mask. We could do this just for BackendType, or we could put this in
src/include/lib/ with a more generic name, like "bitmask_u32".

2. Another idea is to redefine the BackendType values to be separate
bits, like the current BACKEND_TYPE_* values in postmaster.c:

typedef enum BackendType
{
B_INVALID = 0,

/* Backends and other backend-like processes */
B_BACKEND = 1 << 1,
B_DEAD_END_BACKEND = 1 << 2,
B_AUTOVAC_LAUNCHER = 1 << 3,
B_AUTOVAC_WORKER = 1 << 4,

...
} BackendType;

Then you can use | and & on BackendTypes directly. It makes it less
clear which function arguments are a BackendType and which are a
bitmask, however.

Thoughts, other ideas?

--
Heikki Linnakangas
Neon (https://neon.tech)

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#5)

Re: Refactoring postmaster's code to cleanup after child exit

Hello Heikki,

10.08.2024 00:13, Heikki Linnakangas wrote:

Committed the patches up to and including this one, with tiny comment changes.

I've noticed that on the current HEAD server.log contains binary data
(an invalid process name) after a child crash. For example, while playing
with -ftapv, I've got:
SELECT to_date('2024 613566758 1', 'IYYY IW ID');
server closed the connection unexpectedly

grep -a 'was terminated' server.log
2024-08-18 07:07:06.482 UTC|||66c19d96.3482f6|LOG: `�!x� (PID 3441407) was terminated by signal 6: Aborted

It looks like this was introduced by commit 28a520c0b (IIUC, namebuf in
CleanupBackend() may stay uninitialized in some code paths).

Best regards,
Alexander

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Alexander Lakhin (#7)

Re: Refactoring postmaster's code to cleanup after child exit

On 18/08/2024 11:00, Alexander Lakhin wrote:

10.08.2024 00:13, Heikki Linnakangas wrote:

Committed the patches up to and including this one, with tiny comment changes.

I've noticed that on the current HEAD server.log contains binary data
(an invalid process name) after a child crash. For example, while playing
with -ftapv, I've got:
SELECT to_date('2024 613566758 1', 'IYYY IW ID');
server closed the connection unexpectedly

grep -a 'was terminated' server.log
2024-08-18 07:07:06.482 UTC|||66c19d96.3482f6|LOG: `�!x� (PID 3441407) was terminated by signal 6: Aborted

It looks like this was introduced by commit 28a520c0b (IIUC, namebuf in
CleanupBackend() may stay uninitialized in some code paths).

Fixed, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Heikki Linnakangas (#6)

Re: Refactoring postmaster's code to cleanup after child exit

Hi,

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:

While rebasing this today, I spotted another instance of that mistake
mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)"
instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make that
less error-prone:

1. Add a separate typedef for the bitmasks, and macros/functions to work
with it. Something like:

typedef struct {
uint32 mask;
} BackendTypeMask;

static const BackendTypeMask BTMASK_ALL = { 0xffffffff };
static const BackendTypeMask BTMASK_NONE = { 0 };

static inline BackendTypeMask
BTMASK_ADD(BackendTypeMask mask, BackendType t)
{
mask.mask |= 1 << t;
return mask;
}

static inline BackendTypeMask
BTMASK_DEL(BackendTypeMask mask, BackendType t)
{
mask.mask &= ~(1 << t);
return mask;
}

Now the compiler will complain if you try to pass a BackendType for the
mask. We could do this just for BackendType, or we could put this in
src/include/lib/ with a more generic name, like "bitmask_u32".

I don't like the second suggestion - that just ends up creating a similar
problem in the future because flag values for one thing can be passed to
something else.

+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
+
+See src/test/perl/README for more info about running these tests.

Is it really useful to have such instructions all over the tree?

From 93b9e9b6e072f63af9009e0d66ab6d0d62ea8c15 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Aug 2024 10:55:11 +0300
Subject: [PATCH v4 2/8] Add test for dead-end backends

The code path for launching a dead-end backend because we're out of
slots was not covered by any tests, so add one. (Some tests did hit
the case of launching a dead-end backend because the server is still
starting up, though, so the gap in our test coverage wasn't as big as
it sounds.)
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 39 +++++++++++++++++++
.../postmaster/t/001_connection_limits.pl | 17 +++++++-
2 files changed, 55 insertions(+), 1 deletion(-)

Why does this need to use "raw" connections? Can't you just create a bunch of
connections with BackgroundPsql?

From 88287a2db95e584018f1c7fa9e992feb7d179ce8 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Aug 2024 10:58:35 +0300
Subject: [PATCH v4 3/8] Use an shmem_exit callback to remove backend from
PMChildFlags on exit

This seems nicer than having to duplicate the logic between
InitProcess() and ProcKill() for which child processes have a
PMChildFlags slot.

Move the MarkPostmasterChildActive() call earlier in InitProcess(),
out of the section protected by the spinlock.

---
src/backend/storage/ipc/pmsignal.c | 10 ++++++--
src/backend/storage/lmgr/proc.c | 38 ++++++++++--------------------
src/include/storage/pmsignal.h | 1 -
3 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c
index 27844b46a2..cb99e77476 100644
--- a/src/backend/storage/ipc/pmsignal.c
+++ b/src/backend/storage/ipc/pmsignal.c
@@ -24,6 +24,7 @@
#include "miscadmin.h"
#include "postmaster/postmaster.h"
#include "replication/walsender.h"
+#include "storage/ipc.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
#include "utils/memutils.h"
@@ -121,6 +122,8 @@ postmaster_death_handler(SIGNAL_ARGS)

#endif /* USE_POSTMASTER_DEATH_SIGNAL */

+static void MarkPostmasterChildInactive(int code, Datum arg);
+
/*
* PMSignalShmemSize
*		Compute space needed for pmsignal.c's shared memory
@@ -328,6 +331,9 @@ MarkPostmasterChildActive(void)
slot--;
Assert(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED);
PMSignalState->PMChildFlags[slot] = PM_CHILD_ACTIVE;
+
+	/* Arrange to clean up at exit. */
+	on_shmem_exit(MarkPostmasterChildInactive, 0);
}

/*
@@ -352,8 +358,8 @@ MarkPostmasterChildWalSender(void)
* MarkPostmasterChildInactive - mark a postmaster child as done using
* shared memory.  This is called in the child process.
*/
-void
-MarkPostmasterChildInactive(void)
+static void
+MarkPostmasterChildInactive(int code, Datum arg)
{
int			slot = MyPMChildSlot;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ac66da8638..9536469e89 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -308,6 +308,19 @@ InitProcess(void)
if (MyProc != NULL)
elog(ERROR, "you already exist");

+	/*
+	 * Before we start accessing the shared memory in a serious way, mark
+	 * ourselves as an active postmaster child; this is so that the postmaster
+	 * can detect it if we exit without cleaning up.  (XXX autovac launcher
+	 * currently doesn't participate in this; it probably should.)
+	 *
+	 * Slot sync worker also does not participate in it, see comments atop
+	 * 'struct bkend' in postmaster.c.
+	 */
+	if (IsUnderPostmaster && !AmAutoVacuumLauncherProcess() &&
+		!AmLogicalSlotSyncWorkerProcess())
+		MarkPostmasterChildActive();

I'd not necessarily expect a call to MarkPostmasterChildActive() to register
an shmem exit hook - but I guess it's unlikely to be moved around in a
problematic way. Perhaps something like RegisterPostmasterChild() or such
would be a bit clearer?

From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Aug 2024 10:59:04 +0300
Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children

And replace postmaster.c's own "backend type" codes with BackendType

Hm - it seems a bit odd to open-code this when we actually have a "table
driven configuration" available? Why isn't the type a field in
child_process_kind?

That'd not solve the bitmask confusion issue, but it does seem like a better
direction to me?

From 9c832ce33667abc5aef128a17fa9c27daaad872a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Aug 2024 10:59:27 +0300
Subject: [PATCH v4 5/8] Kill dead-end children when there's nothing else left

Previously, the postmaster would never try to kill dead-end child
processes, even if there were no other processes left. A dead-end
backend will eventually exit, when authentication_timeout expires, but
if a dead-end backend is the only thing that's preventing the server
from shutting down, it seems better to kill it immediately. It's
particularly important, if there was a bug in the early startup code
that prevented a dead-end child from timing out and exiting normally.

I do wonder if we shouldn't instead get rid of dead end children. We now have
an event based loop in postmaster, it'd perform vastly better to juts handle
these connections in postmaster. And we'd get rid of these weird backend
types. But I guess this is a worthwhile improvement on its own...

Includes a test for that case where a dead-end backend previously kept
the server from shutting down.

The test hardcodes timeouts, I think we've largely come to regret that when we
did. Should probably just be a multiplier based on
PostgreSQL::Test::Utils::timeout_default?

+/*
+ * MaxLivePostmasterChildren
+ *
+ * This reports the number postmaster child processes that can be active.  It
+ * includes all children except for dead_end children.  This allows the array
+ * in shared memory (PMChildFlags) to have a fixed maximum size.
+ */
+int
+MaxLivePostmasterChildren(void)
+{
+	int			n = 0;
+
+	/* We know exactly how mamy worker and aux processes can be active */
+	n += autovacuum_max_workers;
+	n += max_worker_processes;
+	n += NUM_AUXILIARY_PROCS;
+
+	/*
+	 * We allow more connections here than we can have backends because some
+	 * might still be authenticating; they might fail auth, or some existing
+	 * backend might exit before the auth cycle is completed.  The exact
+	 * MaxBackends limit is enforced when a new backend tries to join the
+	 * shared-inval backend array.
+	 */
+	n += 2 * (MaxConnections + max_wal_senders);
+
+	return n;
+}

I wonder if we could instead maintain at least some of this in
child_process_kinds? Manually listing different types of processes in
different places doesn't seem particularly sustainable.

+/*
+ * Initialize at postmaster startup
+ */
+void
+InitPostmasterChildSlots(void)
+{
+	int			num_pmchild_slots;
+	int			slotno;
+	PMChild    *slots;
+
+	dlist_init(&freeBackendList);
+	dlist_init(&freeAutoVacWorkerList);
+	dlist_init(&freeBgWorkerList);
+	dlist_init(&freeAuxList);
+	dlist_init(&ActiveChildList);
+
+	num_pmchild_slots = MaxLivePostmasterChildren();
+
+	slots = palloc(num_pmchild_slots * sizeof(PMChild));
+
+	slotno = 0;
+	for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeBackendList);
+		slotno++;
+	}
+	for (int i = 0; i < autovacuum_max_workers; i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList);
+		slotno++;
+	}
+	for (int i = 0; i < max_worker_processes; i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeBgWorkerList);
+		slotno++;
+	}
+	for (int i = 0; i < NUM_AUXILIARY_PROCS; i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeAuxList);
+		slotno++;
+	}
+	Assert(slotno == num_pmchild_slots);
+}

Along the same vein - could we generalize this into one array of different
slot types and then loop over that to initialize / acquire the slots?

+/* Return the appropriate free-list for the given backend type */
+static dlist_head *
+GetFreeList(BackendType btype)
+{
+	switch (btype)
+	{
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_WAL_SENDER:
+		case B_SLOTSYNC_WORKER:
+			return &freeBackendList;

Maybe a daft question - but why are all of these in the same list? Sure,
they're essentially all full backends, but they're covered by different GUCs?

+			/*
+			 * Auxiliary processes.  There can be only one of each of these
+			 * running at a time.
+			 */
+		case B_AUTOVAC_LAUNCHER:
+		case B_ARCHIVER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STARTUP:
+		case B_WAL_RECEIVER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
+			return &freeAuxList;
+
+			/*
+			 * Logger is not connected to shared memory, and does not have a
+			 * PGPROC entry, but we still allocate a child slot for it.
+			 */

Tangential: Why do we need a freelist for these and why do we choose a random
pgproc for these instead of assigning one statically?

Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
of buffers that can't be done in-place, like writes when checksums are
enabled). The varying proc numbers make that harder than it'd have to be...

+PMChild *
+AssignPostmasterChildSlot(BackendType btype)
+{
+	dlist_head *freelist;
+	PMChild    *pmchild;
+
+	freelist = GetFreeList(btype);
+
+	if (dlist_is_empty(freelist))
+		return NULL;
+
+	pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
+	pmchild->pid = 0;
+	pmchild->bkend_type = btype;
+	pmchild->rw = NULL;
+	pmchild->bgworker_notify = true;
+
+	/*
+	 * pmchild->child_slot for each entry was initialized when the array of
+	 * slots was allocated.
+	 */
+
+	dlist_push_head(&ActiveChildList, &pmchild->elem);
+
+	ReservePostmasterChildSlot(pmchild->child_slot);
+
+	/* FIXME: find a more elegant way to pass this */
+	MyPMChildSlot = pmchild->child_slot;

What if we assigned one offset for each process and assigned its ID here and
also used that for its ProcNumber - that way we wouldn't need to manage
freelists in two places.

+PMChild *
+FindPostmasterChildByPid(int pid)
+{
+	dlist_iter	iter;
+
+	dlist_foreach(iter, &ActiveChildList)
+	{
+		PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
+
+		if (bp->pid == pid)
+			return bp;
+	}
+	return NULL;
+}

It's not new, but it's quite sad that postmaster's process exit handling is
effectively O(Backends^2)...

@@ -1019,7 +980,15 @@ PostmasterMain(int argc, char *argv[])
/*
* If enabled, start up syslogger collection subprocess
*/
-	SysLoggerPID = SysLogger_Start();
+	SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
+	if (!SysLoggerPMChild)
+		elog(ERROR, "no postmaster child slot available for syslogger");
+	SysLoggerPMChild->pid = SysLogger_Start();
+	if (SysLoggerPMChild->pid == 0)
+	{
+		FreePostmasterChildSlot(SysLoggerPMChild);
+		SysLoggerPMChild = NULL;
+	}

Maybe it's a bit obsessive, but this seems long enough to make it worth not
doing inline in the already long PostmasterMain().

/*
* We're ready to rock and roll...
*/
-	StartupPID = StartChildProcess(B_STARTUP);
-	Assert(StartupPID != 0);
+	StartupPMChild = StartChildProcess(B_STARTUP);
+	Assert(StartupPMChild != NULL);

This (not new) assertion is ... odd.

@@ -1779,21 +1748,6 @@ canAcceptConnections(int backend_type)
if (!connsAllowed && backend_type == B_BACKEND)
return CAC_SHUTDOWN; /* shutdown is pending */

- /*
- * Don't start too many children.
- *
- * We allow more connections here than we can have backends because some
- * might still be authenticating; they might fail auth, or some existing
- * backend might exit before the auth cycle is completed. The exact
- * MaxBackends limit is enforced when a new backend tries to join the
- * shared-inval backend array.
- *
- * The limit here must match the sizes of the per-child-process arrays;
- * see comments for MaxLivePostmasterChildren().
- */
- if (CountChildren(BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND)) >= MaxLivePostmasterChildren())
- result = CAC_TOOMANY;
-
return result;
}

It's nice to get rid of this source of O(N^2).

@@ -1961,26 +1915,6 @@ process_pm_reload_request(void)
(errmsg("received SIGHUP, reloading configuration files")));
ProcessConfigFile(PGC_SIGHUP);
SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND));
-		if (StartupPID != 0)
-			signal_child(StartupPID, SIGHUP);
-		if (BgWriterPID != 0)
-			signal_child(BgWriterPID, SIGHUP);
-		if (CheckpointerPID != 0)
-			signal_child(CheckpointerPID, SIGHUP);
-		if (WalWriterPID != 0)
-			signal_child(WalWriterPID, SIGHUP);
-		if (WalReceiverPID != 0)
-			signal_child(WalReceiverPID, SIGHUP);
-		if (WalSummarizerPID != 0)
-			signal_child(WalSummarizerPID, SIGHUP);
-		if (AutoVacPID != 0)
-			signal_child(AutoVacPID, SIGHUP);
-		if (PgArchPID != 0)
-			signal_child(PgArchPID, SIGHUP);
-		if (SysLoggerPID != 0)
-			signal_child(SysLoggerPID, SIGHUP);
-		if (SlotSyncWorkerPID != 0)
-			signal_child(SlotSyncWorkerPID, SIGHUP);

/* Reload authentication config files too */
if (!load_hba())

For a moment I wondered why this change was part of this commit - but I guess
we didn't have any of these in an array/list before this change...

@@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
}

/* Was it the system logger?  If so, try to start a new one */
-		if (pid == SysLoggerPID)
+		if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
{
-			SysLoggerPID = 0;
/* for safety's sake, launch new logger *first* */
-			SysLoggerPID = SysLogger_Start();
+			SysLoggerPMChild->pid = SysLogger_Start();
+			if (SysLoggerPMChild->pid == 0)
+			{
+				FreePostmasterChildSlot(SysLoggerPMChild);
+				SysLoggerPMChild = NULL;
+			}
if (!EXIT_STATUS_0(exitstatus))
LogChildExit(LOG, _("system logger process"),

Seems a bit weird to have one place with a different memory lifetime handling
than other places. Why don't we just do this the same way as in other places
but continue to defer the logging until after we tried to start the new
logger?

Might be worth having a test ensuring that loggers restart OK.

/* Construct a process name for log message */
+
+	/*
+	 * FIXME: use GetBackendTypeDesc here? How does the localization of that
+	 * work?
+	 */
if (bp->bkend_type == B_DEAD_END_BACKEND)
{
procname = _("dead end backend");

Might be worth having a version of GetBackendTypeDesc() that returns a
translated string?

@@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
{
dlist_iter iter;

-		dlist_foreach(iter, &BackendList)
+		dlist_foreach(iter, &ActiveChildList)
{
-			Backend    *bp = dlist_container(Backend, elem, iter.cur);
+			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
+
+			/* We do NOT restart the syslogger */
+			if (bp == SysLoggerPMChild)
+				continue;

That comment seems a bit misleading - we do restart syslogger, we just don't
do it here, no? I realize it's an old comment, but it still seems like it's
worth fixing given that you touch all the code here...

@@ -2708,48 +2661,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
* We could exclude dead_end children here, but at least when
* sending SIGABRT it seems better to include them.
*/
-			sigquit_child(bp->pid);
+			sigquit_child(bp);
}
-
-		if (StartupPID != 0)
-		{
-			sigquit_child(StartupPID);
-			StartupStatus = STARTUP_SIGNALED;
-		}
-
-		/* Take care of the bgwriter too */
-		if (BgWriterPID != 0)
-			sigquit_child(BgWriterPID);
-
-		/* Take care of the checkpointer too */
-		if (CheckpointerPID != 0)
-			sigquit_child(CheckpointerPID);
-
-		/* Take care of the walwriter too */
-		if (WalWriterPID != 0)
-			sigquit_child(WalWriterPID);
-
-		/* Take care of the walreceiver too */
-		if (WalReceiverPID != 0)
-			sigquit_child(WalReceiverPID);
-
-		/* Take care of the walsummarizer too */
-		if (WalSummarizerPID != 0)
-			sigquit_child(WalSummarizerPID);
-
-		/* Take care of the autovacuum launcher too */
-		if (AutoVacPID != 0)
-			sigquit_child(AutoVacPID);
-
-		/* Take care of the archiver too */
-		if (PgArchPID != 0)
-			sigquit_child(PgArchPID);
-
-		/* Take care of the slot sync worker too */
-		if (SlotSyncWorkerPID != 0)
-			sigquit_child(SlotSyncWorkerPID);
-
-		/* We do NOT restart the syslogger */
}

Yay.

@@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
<snip>
-		if (WalSummarizerPID != 0)
-			signal_child(WalSummarizerPID, SIGTERM);
-		if (SlotSyncWorkerPID != 0)
-			signal_child(SlotSyncWorkerPID, SIGTERM);
+		targetMask |= (1 << B_STARTUP);
+		targetMask |= (1 << B_WAL_RECEIVER);
+
+		targetMask |= (1 << B_WAL_SUMMARIZER);
+		targetMask |= (1 << B_SLOTSYNC_WORKER);
/* checkpointer, archiver, stats, and syslogger may continue for now */

+		SignalSomeChildren(SIGTERM, targetMask);
+
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
pmState = PM_WAIT_BACKENDS;
<snip>

It's likely the right thing to not do as one patch, but IMO this really wants
to be a state table. Perhaps as part of child_process_kinds, perhaps separate
from that.

@@ -3130,8 +3047,21 @@ static void
LaunchMissingBackgroundProcesses(void)
{
/* Syslogger is active in all states */
-	if (SysLoggerPID == 0 && Logging_collector)
-		SysLoggerPID = SysLogger_Start();
+	if (SysLoggerPMChild == NULL && Logging_collector)
+	{
+		SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
+		if (!SysLoggerPMChild)
+			elog(LOG, "no postmaster child slot available for syslogger");

How could this elog() be reached? Seems something seriously would have gone
wrong to get here - in which case a LOG that might not even be visible (due to
logger not working) doesn't seem like the right response.

@@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask)
static void
TerminateChildren(int signal)
{

The comment for TerminateChildren() says "except syslogger and dead_end
backends." - aren't you including the latter here?

@@ -311,14 +311,9 @@ InitProcess(void)
/*
* Before we start accessing the shared memory in a serious way, mark
* ourselves as an active postmaster child; this is so that the postmaster
-	 * can detect it if we exit without cleaning up.  (XXX autovac launcher
-	 * currently doesn't participate in this; it probably should.)
-	 *
-	 * Slot sync worker also does not participate in it, see comments atop
-	 * 'struct bkend' in postmaster.c.
+	 * can detect it if we exit without cleaning up.
*/
-	if (IsUnderPostmaster && !AmAutoVacuumLauncherProcess() &&
-		!AmLogicalSlotSyncWorkerProcess())
+	if (IsUnderPostmaster)
MarkPostmasterChildActive();

/* Decide which list should supply our PGPROC. */
@@ -536,6 +531,9 @@ InitAuxiliaryProcess(void)
if (MyProc != NULL)
elog(ERROR, "you already exist");

+	if (IsUnderPostmaster)
+		MarkPostmasterChildActive();
+
/*
* We use the ProcStructLock to protect assignment and releasing of
* AuxiliaryProcs entries.

Probably worth, at some point soon, to have an InitProcessCommon() or such.

Greetings,

Andres Freund

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Andres Freund (#9)

Re: Refactoring postmaster's code to cleanup after child exit

On 04/09/2024 17:35, Andres Freund wrote:

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:

+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
+
+See src/test/perl/README for more info about running these tests.

Is it really useful to have such instructions all over the tree?

That's debatable but I didn't want to go down that rabbit hole with this
patch.

It's repetitive for sure. But there are small variations in which
PG_TEST_EXTRA options you need, whether "make installcheck" runs against
a running server or still creates a temporary cluster, etc.

I tried to deduplicate those instructions by moving the above
boilerplate to src/test/README, and only noting the variations in the
subdirectory READMEs. I didn't like the result. It's very helpful to
have full copy-pasteable commands with all the right "PG_TEST_EXTRA"
options for each test.

These instructions also don't mention how to run the tests with Meson.
The first time I wanted to run individual tests with Meson, it took me a
while to figure it out.

I'll think a little more about how to improve these READMEs, but let's
take that to a separate thread.

From 93b9e9b6e072f63af9009e0d66ab6d0d62ea8c15 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Aug 2024 10:55:11 +0300
Subject: [PATCH v4 2/8] Add test for dead-end backends

The code path for launching a dead-end backend because we're out of
slots was not covered by any tests, so add one. (Some tests did hit
the case of launching a dead-end backend because the server is still
starting up, though, so the gap in our test coverage wasn't as big as
it sounds.)
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 39 +++++++++++++++++++
.../postmaster/t/001_connection_limits.pl | 17 +++++++-
2 files changed, 55 insertions(+), 1 deletion(-)

Why does this need to use "raw" connections? Can't you just create a bunch of
connections with BackgroundPsql?

No, these need to be connections that haven't sent the startup packet
the yet.

With Andrew's PqFFI work [1]/messages/by-id/97d1d1b9-d147-f69d-1991-d8794efed41c@dunslane.net, we could do better. The latest version on
that thread doesn't expose the async functions like PQconnectStart()
PQconnectPoll() though, but they can be added.

[1]: /messages/by-id/97d1d1b9-d147-f69d-1991-d8794efed41c@dunslane.net
/messages/by-id/97d1d1b9-d147-f69d-1991-d8794efed41c@dunslane.net

Unless you have comments on these first two patches which just add
tests, I'll commit them shortly. Still processing the rest of your
comments...

--
Heikki Linnakangas
Neon (https://neon.tech)

#11

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Andres Freund (#9)

Re: Refactoring postmaster's code to cleanup after child exit

On 04/09/2024 17:35, Andres Freund wrote:

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:

From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Aug 2024 10:59:04 +0300
Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children

And replace postmaster.c's own "backend type" codes with BackendType

Hm - it seems a bit odd to open-code this when we actually have a "table
driven configuration" available? Why isn't the type a field in
child_process_kind?

Sorry, I didn't understand this. What exactly would you add to
child_process_kind? Where would you use it?

+/*
+ * MaxLivePostmasterChildren
+ *
+ * This reports the number postmaster child processes that can be active.  It
+ * includes all children except for dead_end children.  This allows the array
+ * in shared memory (PMChildFlags) to have a fixed maximum size.
+ */
+int
+MaxLivePostmasterChildren(void)
+{
+	int			n = 0;
+
+	/* We know exactly how mamy worker and aux processes can be active */
+	n += autovacuum_max_workers;
+	n += max_worker_processes;
+	n += NUM_AUXILIARY_PROCS;
+
+	/*
+	 * We allow more connections here than we can have backends because some
+	 * might still be authenticating; they might fail auth, or some existing
+	 * backend might exit before the auth cycle is completed.  The exact
+	 * MaxBackends limit is enforced when a new backend tries to join the
+	 * shared-inval backend array.
+	 */
+	n += 2 * (MaxConnections + max_wal_senders);
+
+	return n;
+}

I wonder if we could instead maintain at least some of this in
child_process_kinds? Manually listing different types of processes in
different places doesn't seem particularly sustainable.

Hmm, you mean adding "max this kind of children" field to
child_process_kinds? Perhaps.

+/*
+ * Initialize at postmaster startup
+ */
+void
+InitPostmasterChildSlots(void)
+{
+	int			num_pmchild_slots;
+	int			slotno;
+	PMChild    *slots;
+
+	dlist_init(&freeBackendList);
+	dlist_init(&freeAutoVacWorkerList);
+	dlist_init(&freeBgWorkerList);
+	dlist_init(&freeAuxList);
+	dlist_init(&ActiveChildList);
+
+	num_pmchild_slots = MaxLivePostmasterChildren();
+
+	slots = palloc(num_pmchild_slots * sizeof(PMChild));
+
+	slotno = 0;
+	for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeBackendList);
+		slotno++;
+	}
+	for (int i = 0; i < autovacuum_max_workers; i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList);
+		slotno++;
+	}
+	for (int i = 0; i < max_worker_processes; i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeBgWorkerList);
+		slotno++;
+	}
+	for (int i = 0; i < NUM_AUXILIARY_PROCS; i++)
+	{
+		init_slot(&slots[slotno], slotno, &freeAuxList);
+		slotno++;
+	}
+	Assert(slotno == num_pmchild_slots);
+}

Along the same vein - could we generalize this into one array of different
slot types and then loop over that to initialize / acquire the slots?

Makes sense.

+/* Return the appropriate free-list for the given backend type */
+static dlist_head *
+GetFreeList(BackendType btype)
+{
+	switch (btype)
+	{
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_WAL_SENDER:
+		case B_SLOTSYNC_WORKER:
+			return &freeBackendList;
Maybe a daft question - but why are all of these in the same list? Sure,
they're essentially all full backends, but they're covered by different GUCs?

No reason. No particular reason they should *not* share the same list
either though.

+			/*
+			 * Auxiliary processes.  There can be only one of each of these
+			 * running at a time.
+			 */
+		case B_AUTOVAC_LAUNCHER:
+		case B_ARCHIVER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STARTUP:
+		case B_WAL_RECEIVER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
+			return &freeAuxList;
+
+			/*
+			 * Logger is not connected to shared memory, and does not have a
+			 * PGPROC entry, but we still allocate a child slot for it.
+			 */
Tangential: Why do we need a freelist for these and why do we choose a random
pgproc for these instead of assigning one statically?

Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
of buffers that can't be done in-place, like writes when checksums are
enabled). The varying proc numbers make that harder than it'd have to be...

Yeah, we can make these fixed.Currently, the # of slots reserved for aux
processes is sized by NUM_AUXILIARY_PROCS, which is one smaller than the
number of different aux proces kinds:

/*
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
* Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
* run during normal operation. Startup process and WAL receiver also consume
* 2 slots, but WAL writer is launched only after startup has exited, so we
* only need 6 slots.
*/
#define NUM_AUXILIARY_PROCS 6

For PMChildSlot numbers, we could certainly just allocate one more slot.

It would probably make sense for PGPROCs too, even though PGPROC is a
much larger struct.

+PMChild *
+AssignPostmasterChildSlot(BackendType btype)
+{
+	dlist_head *freelist;
+	PMChild    *pmchild;
+
+	freelist = GetFreeList(btype);
+
+	if (dlist_is_empty(freelist))
+		return NULL;
+
+	pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
+	pmchild->pid = 0;
+	pmchild->bkend_type = btype;
+	pmchild->rw = NULL;
+	pmchild->bgworker_notify = true;
+
+	/*
+	 * pmchild->child_slot for each entry was initialized when the array of
+	 * slots was allocated.
+	 */
+
+	dlist_push_head(&ActiveChildList, &pmchild->elem);
+
+	ReservePostmasterChildSlot(pmchild->child_slot);
+
+	/* FIXME: find a more elegant way to pass this */
+	MyPMChildSlot = pmchild->child_slot;

What if we assigned one offset for each process and assigned its ID here and
also used that for its ProcNumber - that way we wouldn't need to manage
freelists in two places.

It's currently possible to have up to 2 * max_connections backends in
the authentication phase. We would have to change that behaviour, or
make the PGPROC array 2x larger.

It might well be worth it, I don't know how sensible the current
behaviour is. But I'd like to punt that to later patch, to keep the
scope of this patch set reasonable. It's pretty straightforward to do
later on top of this if we want to.

+PMChild *
+FindPostmasterChildByPid(int pid)
+{
+	dlist_iter	iter;
+
+	dlist_foreach(iter, &ActiveChildList)
+	{
+		PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
+
+		if (bp->pid == pid)
+			return bp;
+	}
+	return NULL;
+}
It's not new, but it's quite sad that postmaster's process exit handling is
effectively O(Backends^2)...

It would be straightforward to turn ActiveChildList into a hash table.
But I'd like to leave that to a followup patch too.

/*
* We're ready to rock and roll...
*/
-	StartupPID = StartChildProcess(B_STARTUP);
-	Assert(StartupPID != 0);
+	StartupPMChild = StartChildProcess(B_STARTUP);
+	Assert(StartupPMChild != NULL);

This (not new) assertion is ... odd.

Yeah, it's an assertion because StartChildProcess has this:

/*
* fork failure is fatal during startup, but there's no need to choke
* immediately if starting other child types fails.
*/
if (type == B_STARTUP)
ExitPostmaster(1);

@@ -1961,26 +1915,6 @@ process_pm_reload_request(void)
(errmsg("received SIGHUP, reloading configuration files")));
ProcessConfigFile(PGC_SIGHUP);
SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND));
-		if (StartupPID != 0)
-			signal_child(StartupPID, SIGHUP);
-		if (BgWriterPID != 0)
-			signal_child(BgWriterPID, SIGHUP);
-		if (CheckpointerPID != 0)
-			signal_child(CheckpointerPID, SIGHUP);
-		if (WalWriterPID != 0)
-			signal_child(WalWriterPID, SIGHUP);
-		if (WalReceiverPID != 0)
-			signal_child(WalReceiverPID, SIGHUP);
-		if (WalSummarizerPID != 0)
-			signal_child(WalSummarizerPID, SIGHUP);
-		if (AutoVacPID != 0)
-			signal_child(AutoVacPID, SIGHUP);
-		if (PgArchPID != 0)
-			signal_child(PgArchPID, SIGHUP);
-		if (SysLoggerPID != 0)
-			signal_child(SysLoggerPID, SIGHUP);
-		if (SlotSyncWorkerPID != 0)
-			signal_child(SlotSyncWorkerPID, SIGHUP);

/* Reload authentication config files too */
if (!load_hba())

For a moment I wondered why this change was part of this commit - but I guess
we didn't have any of these in an array/list before this change...

Correct.

@@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
}
/* Was it the system logger?  If so, try to start a new one */
-		if (pid == SysLoggerPID)
+		if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
{
-			SysLoggerPID = 0;
/* for safety's sake, launch new logger *first* */
-			SysLoggerPID = SysLogger_Start();
+			SysLoggerPMChild->pid = SysLogger_Start();
+			if (SysLoggerPMChild->pid == 0)
+			{
+				FreePostmasterChildSlot(SysLoggerPMChild);
+				SysLoggerPMChild = NULL;
+			}
if (!EXIT_STATUS_0(exitstatus))
LogChildExit(LOG, _("system logger process"),
Seems a bit weird to have one place with a different memory lifetime handling
than other places. Why don't we just do this the same way as in other places
but continue to defer the logging until after we tried to start the new
logger?

Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?

I'm a little scared of changing the existing logic. We don't have a
mechanism for deferring logging, so we would have to invent that, or the
logs would just accumulate in the pipe until syslogger starts up.
There's some code between here and LaunchMissingBackgroundProcesses(),
so might postmaster get blocked between writing to the syslogger pipe,
before having restarted it?

If forking the syslogger process fails, that can happen anyway, though.

Might be worth having a test ensuring that loggers restart OK.

Yeah..

/* Construct a process name for log message */
+
+	/*
+	 * FIXME: use GetBackendTypeDesc here? How does the localization of that
+	 * work?
+	 */
if (bp->bkend_type == B_DEAD_END_BACKEND)
{
procname = _("dead end backend");

Might be worth having a version of GetBackendTypeDesc() that returns a
translated string?

Constructing the string for background workers is a little more complicated:

snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""),
bp->rw->rw_worker.bgw_type);

We could still do that for background workers and use the transalated
variant of GetBackendTypeDesc() for everything else though.

@@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
{
dlist_iter iter;
-		dlist_foreach(iter, &BackendList)
+		dlist_foreach(iter, &ActiveChildList)
{
-			Backend    *bp = dlist_container(Backend, elem, iter.cur);
+			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
+
+			/* We do NOT restart the syslogger */
+			if (bp == SysLoggerPMChild)
+				continue;
That comment seems a bit misleading - we do restart syslogger, we just don't
do it here, no? I realize it's an old comment, but it still seems like it's
worth fixing given that you touch all the code here...

No, we really do not restart the syslogger. This code runs when
*another* process has crashed unexpectedly. We kill all other processes,
reinitialize shared memory and restart, but the old syslogger keeps
running through all that.

I'll add a note on that to InitPostmasterChildSlots(), as it's a bit
surprising.

@@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
<snip>
-		if (WalSummarizerPID != 0)
-			signal_child(WalSummarizerPID, SIGTERM);
-		if (SlotSyncWorkerPID != 0)
-			signal_child(SlotSyncWorkerPID, SIGTERM);
+		targetMask |= (1 << B_STARTUP);
+		targetMask |= (1 << B_WAL_RECEIVER);
+
+		targetMask |= (1 << B_WAL_SUMMARIZER);
+		targetMask |= (1 << B_SLOTSYNC_WORKER);
/* checkpointer, archiver, stats, and syslogger may continue for now */

+		SignalSomeChildren(SIGTERM, targetMask);
+
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
pmState = PM_WAIT_BACKENDS;
<snip>

It's likely the right thing to not do as one patch, but IMO this really wants
to be a state table. Perhaps as part of child_process_kinds, perhaps separate
from that.

Yeah. I've tried to refactor this into a table before, but didn't come
up with anything that I was happy with. I also feel there must be a
better way to organize this, but not sure what exactly. I hope that will
become more apparent after these other changes.

@@ -3130,8 +3047,21 @@ static void
LaunchMissingBackgroundProcesses(void)
{
/* Syslogger is active in all states */
-	if (SysLoggerPID == 0 && Logging_collector)
-		SysLoggerPID = SysLogger_Start();
+	if (SysLoggerPMChild == NULL && Logging_collector)
+	{
+		SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
+		if (!SysLoggerPMChild)
+			elog(LOG, "no postmaster child slot available for syslogger");
How could this elog() be reached? Seems something seriously would have gone
wrong to get here - in which case a LOG that might not even be visible (due to
logger not working) doesn't seem like the right response.

I'll turn it into an assertion or PANIC.

@@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask)
static void
TerminateChildren(int signal)
{

The comment for TerminateChildren() says "except syslogger and dead_end
backends." - aren't you including the latter here?

The comment is adjusted in
v4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch. Before
that, SignalChildren() does ignore dead-end children.

Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

#12

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#11)

Re: Refactoring postmaster's code to cleanup after child exit

On Fri, Sep 6, 2024 at 9:13 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It's currently possible to have up to 2 * max_connections backends in
the authentication phase. We would have to change that behaviour, or
make the PGPROC array 2x larger.

I know I already said this elsewhere, but in case it got lost in the
shuffle, +1 for changing this, unless somebody can make a compelling
argument why 2 * max_connections isn't WAY too many.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Heikki Linnakangas (#11)

Re: Refactoring postmaster's code to cleanup after child exit

Hi,

On 2024-09-06 16:13:43 +0300, Heikki Linnakangas wrote:

On 04/09/2024 17:35, Andres Freund wrote:

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:

From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Aug 2024 10:59:04 +0300
Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children

And replace postmaster.c's own "backend type" codes with BackendType

Hm - it seems a bit odd to open-code this when we actually have a "table
driven configuration" available? Why isn't the type a field in
child_process_kind?

Sorry, I didn't understand this. What exactly would you add to
child_process_kind? Where would you use it?

I'm not entirely sure what I was thinking of. It might be partially triggering
a prior complaint I had about manually assigning things to MyBackendType,
despite actually having all the information already.

One thing that I just noticed is that this patch orphans comment references to
BACKEND_TYPE_AUTOVAC and BACKEND_TYPE_BGWORKER.

Seems a tad odd to have BACKEND_TYPE_ALL after removing everything else from
the BACKEND_TYPE_* "namespace".

To deal with the issue around bitmasks you had mentioned, I think we should at
least have a static inline function to convert B_* values to the bitmask
index.

+/*
+ * MaxLivePostmasterChildren
+ *
+ * This reports the number postmaster child processes that can be active.  It
+ * includes all children except for dead_end children.  This allows the array
+ * in shared memory (PMChildFlags) to have a fixed maximum size.
+ */
+int
+MaxLivePostmasterChildren(void)
+{
+	int			n = 0;
+
+	/* We know exactly how mamy worker and aux processes can be active */
+	n += autovacuum_max_workers;
+	n += max_worker_processes;
+	n += NUM_AUXILIARY_PROCS;
+
+	/*
+	 * We allow more connections here than we can have backends because some
+	 * might still be authenticating; they might fail auth, or some existing
+	 * backend might exit before the auth cycle is completed.  The exact
+	 * MaxBackends limit is enforced when a new backend tries to join the
+	 * shared-inval backend array.
+	 */
+	n += 2 * (MaxConnections + max_wal_senders);
+
+	return n;
+}

I wonder if we could instead maintain at least some of this in
child_process_kinds? Manually listing different types of processes in
different places doesn't seem particularly sustainable.

Hmm, you mean adding "max this kind of children" field to
child_process_kinds? Perhaps.

Yep, that's what I meant.

+/* Return the appropriate free-list for the given backend type */
+static dlist_head *
+GetFreeList(BackendType btype)
+{
+	switch (btype)
+	{
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_WAL_SENDER:
+		case B_SLOTSYNC_WORKER:
+			return &freeBackendList;
Maybe a daft question - but why are all of these in the same list? Sure,
they're essentially all full backends, but they're covered by different GUCs?
No reason. No particular reason they should *not* share the same list either
though.

Aren't they controlled by distinct connection limits? Isn't there a danger
that we could use up entries and fail connections due to that, despite not
actually being above the limit?

Tangential: Why do we need a freelist for these and why do we choose a random
pgproc for these instead of assigning one statically?

Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
of buffers that can't be done in-place, like writes when checksums are
enabled). The varying proc numbers make that harder than it'd have to be...

Yeah, we can make these fixed.

Cool.

Currently, the # of slots reserved for aux processes is sized by
NUM_AUXILIARY_PROCS, which is one smaller than the number of different aux
proces kinds:

/*
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
* Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
* run during normal operation. Startup process and WAL receiver also consume
* 2 slots, but WAL writer is launched only after startup has exited, so we
* only need 6 slots.
*/
#define NUM_AUXILIARY_PROCS 6

For PMChildSlot numbers, we could certainly just allocate one more slot.

It would probably make sense for PGPROCs too, even though PGPROC is a much
larger struct.

I don't think it's worth worrying about that much. PGPROC is large, but not
*that* large. And the robustness win of properly detecting when there's a
problem around starting/stopping aux workers seems to outweigh that to me.

+PMChild *
+AssignPostmasterChildSlot(BackendType btype)
+{
+	dlist_head *freelist;
+	PMChild    *pmchild;
+
+	freelist = GetFreeList(btype);
+
+	if (dlist_is_empty(freelist))
+		return NULL;
+
+	pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
+	pmchild->pid = 0;
+	pmchild->bkend_type = btype;
+	pmchild->rw = NULL;
+	pmchild->bgworker_notify = true;
+
+	/*
+	 * pmchild->child_slot for each entry was initialized when the array of
+	 * slots was allocated.
+	 */
+
+	dlist_push_head(&ActiveChildList, &pmchild->elem);
+
+	ReservePostmasterChildSlot(pmchild->child_slot);
+
+	/* FIXME: find a more elegant way to pass this */
+	MyPMChildSlot = pmchild->child_slot;
What if we assigned one offset for each process and assigned its ID here and
also used that for its ProcNumber - that way we wouldn't need to manage
freelists in two places.
It's currently possible to have up to 2 * max_connections backends in the
authentication phase. We would have to change that behaviour, or make the
PGPROC array 2x larger.

That however, might be too much...

It might well be worth it, I don't know how sensible the current behaviour
is. But I'd like to punt that to later patch, to keep the scope of this
patch set reasonable. It's pretty straightforward to do later on top of this
if we want to.

Makes sense.

I still think that we'd be better off to just return an error to the client in
postmaster, rather than deal with this dead-end children mess. That was
perhaps justified at some point, but now it seems to add way more complexity
than it's worth. And it's absurdly expensive to fork to return an error. Way
more expensive than just having postmaster send an error and close the socket.

@@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
}
/* Was it the system logger?  If so, try to start a new one */
-		if (pid == SysLoggerPID)
+		if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
{
-			SysLoggerPID = 0;
/* for safety's sake, launch new logger *first* */
-			SysLoggerPID = SysLogger_Start();
+			SysLoggerPMChild->pid = SysLogger_Start();
+			if (SysLoggerPMChild->pid == 0)
+			{
+				FreePostmasterChildSlot(SysLoggerPMChild);
+				SysLoggerPMChild = NULL;
+			}
if (!EXIT_STATUS_0(exitstatus))
LogChildExit(LOG, _("system logger process"),
Seems a bit weird to have one place with a different memory lifetime handling
than other places. Why don't we just do this the same way as in other places
but continue to defer the logging until after we tried to start the new
logger?
Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?

Yea - which it already can do, presumably to handle the case of
logging_collector. It just seems odd to have code to have three places calling
SysLogger_Start() - with some mild variations of the code.

Perhaps we can at least centralize some of that?

But you have a point with:

I'm a little scared of changing the existing logic. We don't have a
mechanism for deferring logging, so we would have to invent that, or the
logs would just accumulate in the pipe until syslogger starts up. There's
some code between here and LaunchMissingBackgroundProcesses(), so might
postmaster get blocked between writing to the syslogger pipe, before having
restarted it?

If forking the syslogger process fails, that can happen anyway, though.

/* Construct a process name for log message */
+
+	/*
+	 * FIXME: use GetBackendTypeDesc here? How does the localization of that
+	 * work?
+	 */
if (bp->bkend_type == B_DEAD_END_BACKEND)
{
procname = _("dead end backend");
Might be worth having a version of GetBackendTypeDesc() that returns a
translated string?
Constructing the string for background workers is a little more complicated:

Random aside: I *hate* that there's no trivial way recognie background workers
in pg_stat_activity, because somebody made pg_stat_activity.backend_type
report something completely under control of extensions...

@@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
{
dlist_iter	iter;
-		dlist_foreach(iter, &BackendList)
+		dlist_foreach(iter, &ActiveChildList)
{
-			Backend    *bp = dlist_container(Backend, elem, iter.cur);
+			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
+
+			/* We do NOT restart the syslogger */
+			if (bp == SysLoggerPMChild)
+				continue;
That comment seems a bit misleading - we do restart syslogger, we just don't
do it here, no? I realize it's an old comment, but it still seems like it's
worth fixing given that you touch all the code here...
No, we really do not restart the syslogger.

Hm?

/* Was it the system logger? If so, try to start a new one */
if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
{
/* for safety's sake, launch new logger *first* */
SysLoggerPMChild->pid = SysLogger_Start(SysLoggerPMChild->child_slot);
if (SysLoggerPMChild->pid == 0)
{
FreePostmasterChildSlot(SysLoggerPMChild);
SysLoggerPMChild = NULL;
}
if (!EXIT_STATUS_0(exitstatus))
LogChildExit(LOG, _("system logger process"),
pid, exitstatus);
continue;
}

We don't do it reaction to other processes crashing, but we still restart it
if it dies. Perhaps it's clear from context - but I had to think aobut it for
a moment.

@@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
<snip>
-		if (WalSummarizerPID != 0)
-			signal_child(WalSummarizerPID, SIGTERM);
-		if (SlotSyncWorkerPID != 0)
-			signal_child(SlotSyncWorkerPID, SIGTERM);
+		targetMask |= (1 << B_STARTUP);
+		targetMask |= (1 << B_WAL_RECEIVER);
+
+		targetMask |= (1 << B_WAL_SUMMARIZER);
+		targetMask |= (1 << B_SLOTSYNC_WORKER);
/* checkpointer, archiver, stats, and syslogger may continue for now */
+		SignalSomeChildren(SIGTERM, targetMask);
+
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
pmState = PM_WAIT_BACKENDS;
<snip>
It's likely the right thing to not do as one patch, but IMO this really wants
to be a state table. Perhaps as part of child_process_kinds, perhaps separate
from that.
Yeah. I've tried to refactor this into a table before, but didn't come up
with anything that I was happy with. I also feel there must be a better way
to organize this, but not sure what exactly. I hope that will become more
apparent after these other changes.

What I'm imagining is something like:
1) Make PMState values each have a distinct bit
2) Move PMState to some (new?) header
3) Add a "uint32 should_run" member to child_process_kind that's a bitmask of
PMStates
4) Add a new function in launch_backend.c that gets passed the "target"
PMState and returns a bitmask of the tasks that should be running (or the
inverse, doesn't really matter).
5) Instead of open-coding the targetMask "computation", use the new function
from 4).

I think that might not look too bad?

Greetings,

Andres Freund

#14

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Andres Freund (#13)

Re: Refactoring postmaster's code to cleanup after child exit

On Tue, Sep 10, 2024 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:

I still think that we'd be better off to just return an error to the client in
postmaster, rather than deal with this dead-end children mess. That was
perhaps justified at some point, but now it seems to add way more complexity
than it's worth. And it's absurdly expensive to fork to return an error. Way
more expensive than just having postmaster send an error and close the socket.

The tricky case is the one where the client write() -- or SSL_write() -- blocks.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Heikki Linnakangas (#6)

Re: Refactoring postmaster's code to cleanup after child exit

Hi,

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:

@@ -2864,6 +2777,8 @@ PostmasterStateMachine(void)
*/
if (pmState == PM_STOP_BACKENDS)
{
+ uint32 targetMask;
+
/*
* Forget any pending requests for background workers, since we're no
* longer willing to launch any new workers. (If additional requests
@@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
*/
ForgetUnstartedBackgroundWorkers();

-		/* Signal all backend children except walsenders and dead-end backends */
-		SignalSomeChildren(SIGTERM,
-						   BACKEND_TYPE_ALL & ~(1 << B_WAL_SENDER | 1 << B_DEAD_END_BACKEND));
+		/* Signal all backend children except walsenders */
+		/* dead-end children are not signalled yet */
+		targetMask = (1 << B_BACKEND);
+		targetMask |= (1 << B_BG_WORKER);
+
/* and the autovac launcher too */
-		if (AutoVacPID != 0)
-			signal_child(AutoVacPID, SIGTERM);
+		targetMask |= (1 << B_AUTOVAC_LAUNCHER);
/* and the bgwriter too */
-		if (BgWriterPID != 0)
-			signal_child(BgWriterPID, SIGTERM);
+		targetMask |= (1 << B_BG_WRITER);
/* and the walwriter too */
-		if (WalWriterPID != 0)
-			signal_child(WalWriterPID, SIGTERM);
+		targetMask |= (1 << B_WAL_WRITER);
/* If we're in recovery, also stop startup and walreceiver procs */
-		if (StartupPID != 0)
-			signal_child(StartupPID, SIGTERM);
-		if (WalReceiverPID != 0)
-			signal_child(WalReceiverPID, SIGTERM);
-		if (WalSummarizerPID != 0)
-			signal_child(WalSummarizerPID, SIGTERM);
-		if (SlotSyncWorkerPID != 0)
-			signal_child(SlotSyncWorkerPID, SIGTERM);
+		targetMask |= (1 << B_STARTUP);
+		targetMask |= (1 << B_WAL_RECEIVER);
+
+		targetMask |= (1 << B_WAL_SUMMARIZER);
+		targetMask |= (1 << B_SLOTSYNC_WORKER);
/* checkpointer, archiver, stats, and syslogger may continue for now */

+		SignalSomeChildren(SIGTERM, targetMask);
+
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
pmState = PM_WAIT_BACKENDS;
}

I think this might now omit shutting down at least autovac workers, which
afaict previously were included.

Greetings,

Andres Freund

#16

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Robert Haas (#14)

Re: Refactoring postmaster's code to cleanup after child exit

Hi,

On 2024-09-10 13:33:36 -0400, Robert Haas wrote:

On Tue, Sep 10, 2024 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:

I still think that we'd be better off to just return an error to the client in
postmaster, rather than deal with this dead-end children mess. That was
perhaps justified at some point, but now it seems to add way more complexity
than it's worth. And it's absurdly expensive to fork to return an error. Way
more expensive than just having postmaster send an error and close the socket.

The tricky case is the one where the client write() -- or SSL_write() -- blocks.

Yea, SSL definitely does make it harder. But it's not exactly rocket science
to do non-blocking SSL connection establishment. After all, we do manage to
do so in libpq...

Greetings,

Andres Freund

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Heikki Linnakangas (#10)

Re: Refactoring postmaster's code to cleanup after child exit

On 06/09/2024 12:52, Heikki Linnakangas wrote:

Unless you have comments on these first two patches which just add
tests, I'll commit them shortly. Still processing the rest of your
comments...

Didn't happen as "shortly" as I thought..

My test for dead-end backends opens 20 TCP (or unix domain) connections
to the server, in quick succession. That works fine my system, and it
passed cirrus CI on other platforms, but on FreeBSD it failed
repeatedly. The behavior in that scenario is apparently
platform-dependent: it depends on the accept queue size, but what
happens when you reach the queue size also seems to depend on the
platform. On my Linux system, the connect() calls in the client are
blocked, if the server is doesn't call accept() fast enough, but
apparently you get an error on *BSD systems.

I'm not sure of the exact details, but in any case, platform-dependent
behavior needs to be avoided in tests. So I changed the test so that it
sends an SSLRequest packet on each connection and waits for reply (which
is always 'N' to reject it in this test), before opening the next
connection. This way, each connection is still left hanging, which is
what I want in this test, but only after postmaster has successfully
accept()ed it and forked the backend.

So here are these test patches again, with that addition.

--
Heikki Linnakangas
Neon (https://neon.tech)

#18

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#17)

Re: Refactoring postmaster's code to cleanup after child exit

On Sat, Oct 5, 2024 at 7:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

My test for dead-end backends opens 20 TCP (or unix domain) connections
to the server, in quick succession. That works fine my system, and it
passed cirrus CI on other platforms, but on FreeBSD it failed
repeatedly. The behavior in that scenario is apparently
platform-dependent: it depends on the accept queue size, but what
happens when you reach the queue size also seems to depend on the
platform. On my Linux system, the connect() calls in the client are
blocked, if the server is doesn't call accept() fast enough, but
apparently you get an error on *BSD systems.

Right, we've analysed that difference in AF_UNIX implementation
before[1]/messages/by-id/CADc_NKg2d+oZY9mg4DdQdoUcGzN2kOYXBu-3--RW_hEe0tUV=g@mail.gmail.com, which shows up in the real world, where client sockets ie
libpq's are usually non-blocking, as EAGAIN on Linux (which is not
valid per POSIX) vs ECONNREFUSED on other OSes. All fail to connect,
but the error message is different.

For blocking AF_UNIX client sockets like in your test, Linux
effectively has an infinite queue made from two layers. The listen
queue (a queue of connecting sockets) does respect the requested
backlog size, but when it's full it has an extra trick: the connect()
call waits (in a queue of threads) for space to become free in the
listen queue, so it's effectively unlimited (but only for blocking
sockets), while FreeBSD and I suspect any other implementation
deriving from or reimplementing the BSD socket code gives you
ECONNREFUSED. macOS behaves just the same as FreeBSD AFAICT, so I
don't know why you didn't see the same thing... I guess it's just
racing against accept() draining the queue.

It's possible that Windows copied the Linux behaviour for AF_UNIX,
given that it probably has something to do with the WSL project for
emulating Linux, but IDK.

[1]: /messages/by-id/CADc_NKg2d+oZY9mg4DdQdoUcGzN2kOYXBu-3--RW_hEe0tUV=g@mail.gmail.com

I'm not sure of the exact details, but in any case, platform-dependent
behavior needs to be avoided in tests. So I changed the test so that it
sends an SSLRequest packet on each connection and waits for reply (which
is always 'N' to reject it in this test), before opening the next
connection. This way, each connection is still left hanging, which is
what I want in this test, but only after postmaster has successfully
accept()ed it and forked the backend.

Makes sense.

#19

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Thomas Munro (#18)

Re: Refactoring postmaster's code to cleanup after child exit

On 05/10/2024 01:03, Thomas Munro wrote:

On Sat, Oct 5, 2024 at 7:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

My test for dead-end backends opens 20 TCP (or unix domain) connections
to the server, in quick succession. That works fine my system, and it
passed cirrus CI on other platforms, but on FreeBSD it failed
repeatedly. The behavior in that scenario is apparently
platform-dependent: it depends on the accept queue size, but what
happens when you reach the queue size also seems to depend on the
platform. On my Linux system, the connect() calls in the client are
blocked, if the server is doesn't call accept() fast enough, but
apparently you get an error on *BSD systems.

Right, we've analysed that difference in AF_UNIX implementation
before[1], which shows up in the real world, where client sockets ie
libpq's are usually non-blocking, as EAGAIN on Linux (which is not
valid per POSIX) vs ECONNREFUSED on other OSes. All fail to connect,
but the error message is different.

Thanks for the pointer!

For blocking AF_UNIX client sockets like in your test, Linux
effectively has an infinite queue made from two layers. The listen
queue (a queue of connecting sockets) does respect the requested
backlog size, but when it's full it has an extra trick: the connect()
call waits (in a queue of threads) for space to become free in the
listen queue, so it's effectively unlimited (but only for blocking
sockets), while FreeBSD and I suspect any other implementation
deriving from or reimplementing the BSD socket code gives you
ECONNREFUSED. macOS behaves just the same as FreeBSD AFAICT, so I
don't know why you didn't see the same thing... I guess it's just
racing against accept() draining the queue.

In fact I misremembered: the failure happened on macOS, *not* FreeBSD.
It could be just luck I didn't see it on FreeBSD though.

It's possible that Windows copied the Linux behaviour for AF_UNIX,
given that it probably has something to do with the WSL project for
emulating Linux, but IDK.

Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or
at least on this perl distribution we're using in Cirrus CI):

Socket::pack_sockaddr_un not implemented on this architecture at
C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872.

so I'll have to disable this test on Windows anyway.

--
Heikki Linnakangas
Neon (https://neon.tech)

#20

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

over 1 year ago

In reply to: Heikki Linnakangas (#19)

Re: Refactoring postmaster's code to cleanup after child exit

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 05/10/2024 01:03, Thomas Munro wrote:

It's possible that Windows copied the Linux behaviour for AF_UNIX,
given that it probably has something to do with the WSL project for
emulating Linux, but IDK.

Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or
at least on this perl distribution we're using in Cirrus CI):

Socket::pack_sockaddr_un not implemented on this architecture at
C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872.

so I'll have to disable this test on Windows anyway.

Socket version 2.028 (included in Perl 5.32) provides pack_sockaddr_un()
on Windows, so that can be fixed by bumping the Perl version in
https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl
to something more modern (such as 5.40.0.1), and only skipping the test
if on Windows if Socket is too old.

The decision to use 5.26 seems to come from the initial creation of the
CI images in 2021 (when 5.34 was current), with the comment «newer
versions don't currently work correctly for plperl». That claim is
worth revisiting, and fixing if it's still the case.

- ilmari

#21

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Dagfinn Ilmari Mannsåker (#20)

#22

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Dagfinn Ilmari Mannsåker (#20)

#23

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#22)

#24

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Andres Freund (#15)

#25

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Heikki Linnakangas (#24)

#26

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Heikki Linnakangas (#25)

#27

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#26)

#28

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Heikki Linnakangas (#27)

#29

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#28)

#30

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Heikki Linnakangas (#29)

#31

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Heikki Linnakangas (#30)

#32

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#26)

#33

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#30)

#34

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Andres Freund (#33)

#35

Noah Misch

noah@leadboat.com

about 1 year ago

In reply to: Andres Freund (#32)

#36

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Noah Misch (#35)

#37

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 1 year ago

In reply to: Michael Paquier (#34)

#38

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#36)

#39

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#38)

#40

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#39)

#41

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#40)

#42

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#41)

#43

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#42)

#44

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Heikki Linnakangas (#22)

#45

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Heikki Linnakangas (#22)

#46

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Thomas Munro (#4)

#47

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Heikki Linnakangas (#46)

#48

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Heikki Linnakangas (#46)

Refactoring postmaster's code to cleanup after child exit

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: