Reorder shutdown sequence, to flush pgstats later

Started by Andres Freundabout 1 year ago22 messages
#1Andres Freund
andres@anarazel.de
4 attachment(s)

Hi,

In the AIO patchset I just hit a corner case in which IO workers can emit
stats ([1]A buffered write completes and RememberSyncRequest() fails, leading to the IO workers performing the flush itself.). That in turn can trigger an assertion failure, because by the
time the IO workers are shut down, checkpointer already has exited and written
out the stats. In this case the relevant IO has completed, but nothing
triggered a stats flush in the IO workers before checkpointer exited.

I think Bilal has also hit this problem somewhat recently, as part of his work
to track WAL IO in pg_stat_io. We intentionally allow walsenders to be active
after checkpointer writes out the shutdown checkpoint and exits.

Robert suggested, when chatting with him about this problem, that we could use
a global barrier to trigger pending stats to be flushed before checkpointer
exits. While that, I think, would fix "my" issue with pending stats in IO
workers, I don't think it'd fix Bilal's problem.

I think that to properly fix this we need a place to flush stats after
everything has shut down.

One way we could do this would be to introduce a new aux process that's
started at the end of the shutdown sequence. But that seems a bit too big a
hammer, without corresponding benefit.

I instead opted to somewhat rearrange the shutdown sequence:

This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.

Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER PMState.

In addition, the existing PM_SHUTDOWN_2 state is renamed to
PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.

This patch is not fully polished, there are a few details I'm not sure about:

- Previously the shutdown checkpoint and process exit were triggered from
HandleCheckpointerInterrupts(). I could have continued doing so in the
patch, but it seemed quite weird to have a wait loop in a function called
HandleCheckpointerInterrupts().

Instead I made the main loop in CheckpointerMain() terminate if checkpointer
was asked to write the shutdown checkpoint or exit

- In process_pm_pmsignal() the code now triggers a PostmasterStateMachine() if
checkpointer signalled ShutdownXLOG having completed. We didn't really have
a need for that before. process_pm_child_exit() does call
PostmasterStateMachine(), but somehow calling it in process_pm_pmsignal()
feels slightly off

- I don't love the naming of the various PMState values (existing and new),
but a larger renaming should probably be done separately?

- There's logging in ShutdownXLOG() that's only triggered if called from
checkpointer. Seems kinda odd - shouldn't we just move that to
checkpointer.c?

The first two patches are just logging improvements that I found helpful to
write this patch:

0001: Instead of modifying pmState directly, do so via a new function
UpdatePMState(), which logs the state transition at DEBUG2.

Requires a helper to stringify PMState.

I've written one-off versions of this patch quite a few times. Without
knowing in what state postmaster is in, it's quite hard to debug the
state machine.

0002: Make logging of postmaster signalling child processes more consistent

I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().

The next is a small prerequisite:

0003: postmaster: Introduce variadic btmask_all_except()

I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.

And then 0004, the reason for this thread.

Comments? Better ideas?

Greetings,

Andres Freund

[1]: A buffered write completes and RememberSyncRequest() fails, leading to the IO workers performing the flush itself.
IO workers performing the flush itself.

Attachments:

v1-0001-postmaster-Update-pmState-in-new-function.patchtext/x-diff; charset=us-asciiDownload
From 5978a67e17b2750b766d959b4529c09a8d5b3023 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 17:59:43 -0500
Subject: [PATCH v1 1/4] postmaster: Update pmState in new function

This mainly makes logging of state changes easier.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 78 +++++++++++++++++++++--------
 1 file changed, 57 insertions(+), 21 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0a915dac4f7..9d23c3482ae 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -411,6 +411,7 @@ static void HandleChildCrash(int pid, int exitstatus, const char *procname);
 static void LogChildExit(int lev, const char *procname,
 						 int pid, int exitstatus);
 static void PostmasterStateMachine(void);
+static void UpdatePMState(PMState newState);
 
 static void ExitPostmaster(int status) pg_attribute_noreturn();
 static int	ServerLoop(void);
@@ -1363,7 +1364,7 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	pmState = PM_STARTUP;
+	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2099,7 +2100,7 @@ process_pm_shutdown_request(void)
 			else if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* There should be no clients, so proceed to stop children */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2133,7 +2134,7 @@ process_pm_shutdown_request(void)
 			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* Just shut down background processes silently */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 			else if (pmState == PM_RUN ||
 					 pmState == PM_HOT_STANDBY)
@@ -2141,7 +2142,7 @@ process_pm_shutdown_request(void)
 				/* Report that we're about to zap live client sessions */
 				ereport(LOG,
 						(errmsg("aborting any active transactions")));
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2176,7 +2177,7 @@ process_pm_shutdown_request(void)
 			/* (note we don't apply send_abort_for_crash here) */
 			SetQuitSignalReason(PMQUIT_FOR_STOP);
 			TerminateChildren(SIGQUIT);
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 
 			/* set stopwatch for them to die */
 			AbortStartTime = time(NULL);
@@ -2231,7 +2232,7 @@ process_pm_child_exit(void)
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2243,7 +2244,7 @@ process_pm_child_exit(void)
 				StartupStatus = STARTUP_NOT_RUNNING;
 				Shutdown = Max(Shutdown, SmartShutdown);
 				TerminateChildren(SIGTERM);
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2288,7 +2289,7 @@ process_pm_child_exit(void)
 				{
 					StartupStatus = STARTUP_NOT_RUNNING;
 					if (pmState == PM_STARTUP)
-						pmState = PM_WAIT_BACKENDS;
+						UpdatePMState(PM_WAIT_BACKENDS);
 				}
 				else
 					StartupStatus = STARTUP_CRASHED;
@@ -2304,7 +2305,7 @@ process_pm_child_exit(void)
 			FatalError = false;
 			AbortStartTime = 0;
 			ReachedNormalRunning = true;
-			pmState = PM_RUN;
+			UpdatePMState(PM_RUN);
 			connsAllowed = true;
 
 			/*
@@ -2377,7 +2378,7 @@ process_pm_child_exit(void)
 				 */
 				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
 
-				pmState = PM_SHUTDOWN_2;
+				UpdatePMState(PM_SHUTDOWN_2);
 			}
 			else
 			{
@@ -2729,7 +2730,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		pmState == PM_RUN ||
 		pmState == PM_STOP_BACKENDS ||
 		pmState == PM_SHUTDOWN)
-		pmState = PM_WAIT_BACKENDS;
+		UpdatePMState(PM_WAIT_BACKENDS);
 
 	/*
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2821,7 +2822,7 @@ PostmasterStateMachine(void)
 			 * Then we're ready to stop other children.
 			 */
 			if (CountChildren(btmask(B_BACKEND)) == 0)
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 		}
 	}
 
@@ -2909,7 +2910,7 @@ PostmasterStateMachine(void)
 
 			SignalChildren(SIGTERM, targetMask);
 
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 		}
 
 		/* Are any of the target processes still running? */
@@ -2920,7 +2921,7 @@ PostmasterStateMachine(void)
 				/*
 				 * Stop any dead-end children and stop creating new ones.
 				 */
-				pmState = PM_WAIT_DEAD_END;
+				UpdatePMState(PM_WAIT_DEAD_END);
 				ConfigurePostmasterWaitSet(false);
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
@@ -2945,7 +2946,7 @@ PostmasterStateMachine(void)
 				if (CheckpointerPMChild != NULL)
 				{
 					signal_child(CheckpointerPMChild, SIGUSR2);
-					pmState = PM_SHUTDOWN;
+					UpdatePMState(PM_SHUTDOWN);
 				}
 				else
 				{
@@ -2960,7 +2961,7 @@ PostmasterStateMachine(void)
 					 * for checkpointer fork failure.
 					 */
 					FatalError = true;
-					pmState = PM_WAIT_DEAD_END;
+					UpdatePMState(PM_WAIT_DEAD_END);
 					ConfigurePostmasterWaitSet(false);
 
 					/* Kill the walsenders and archiver too */
@@ -2980,7 +2981,7 @@ PostmasterStateMachine(void)
 		 */
 		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			pmState = PM_WAIT_DEAD_END;
+			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
 			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 		}
@@ -3013,7 +3014,7 @@ PostmasterStateMachine(void)
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
-			pmState = PM_NO_CHILDREN;
+			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
 
@@ -3097,7 +3098,7 @@ PostmasterStateMachine(void)
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		pmState = PM_STARTUP;
+		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3106,6 +3107,41 @@ PostmasterStateMachine(void)
 	}
 }
 
+static const char *
+pmstate_name(PMState state)
+{
+#define PM_CASE(state) case state: return #state
+	switch (state)
+	{
+			PM_CASE(PM_INIT);
+			PM_CASE(PM_STARTUP);
+			PM_CASE(PM_RECOVERY);
+			PM_CASE(PM_HOT_STANDBY);
+			PM_CASE(PM_RUN);
+			PM_CASE(PM_STOP_BACKENDS);
+			PM_CASE(PM_WAIT_BACKENDS);
+			PM_CASE(PM_SHUTDOWN);
+			PM_CASE(PM_SHUTDOWN_2);
+			PM_CASE(PM_WAIT_DEAD_END);
+			PM_CASE(PM_NO_CHILDREN);
+	}
+#undef PM_CASE
+
+	pg_unreachable();
+}
+
+/*
+ * Simple wrapper for updating pmState. The main reason to have this wrapper
+ * is that it makes it easy to log all state transitions.
+ */
+static void
+UpdatePMState(PMState newState)
+{
+	elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
+		 pmState, pmstate_name(pmState), newState, pmstate_name(newState));
+	pmState = newState;
+}
+
 /*
  * Launch background processes after state change, or relaunch after an
  * existing process has exited.
@@ -3524,7 +3560,7 @@ process_pm_pmsignal(void)
 #endif
 		}
 
-		pmState = PM_RECOVERY;
+		UpdatePMState(PM_RECOVERY);
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
@@ -3539,7 +3575,7 @@ process_pm_pmsignal(void)
 		sd_notify(0, "READY=1");
 #endif
 
-		pmState = PM_HOT_STANDBY;
+		UpdatePMState(PM_HOT_STANDBY);
 		connsAllowed = true;
 
 		/* Some workers may be scheduled to start now */
-- 
2.45.2.746.g06e570c0df.dirty

v1-0002-postmaster-Improve-logging-of-signals-sent-by-pos.patchtext/x-diff; charset=us-asciiDownload
From d7ecde678fb61cc8b56d9b10507ee98506c2bd11 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 20:22:36 -0500
Subject: [PATCH v1 2/4] postmaster: Improve logging of signals sent by
 postmaster

Previously many, in some cases important, signals we never logged. In other
cases the signal name was only included numerically.

Also move from direct use of kill() to signal the av launcher to
signal_child(). There doesn't seem to be a reason for directly using kill().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 61 ++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9d23c3482ae..d57fee54e0d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -419,6 +419,7 @@ static int	BackendStartup(ClientSocket *client_sock);
 static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
 static CAC_state canAcceptConnections(BackendType backend_type);
 static void signal_child(PMChild *pmchild, int signal);
+static void signal_child_ext(PMChild *pmchild, int signal, int elevel);
 static void sigquit_child(PMChild *pmchild);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
@@ -1693,7 +1694,7 @@ ServerLoop(void)
 		{
 			avlauncher_needs_signal = false;
 			if (AutoVacLauncherPMChild != NULL)
-				kill(AutoVacLauncherPMChild->pid, SIGUSR2);
+				signal_child(AutoVacLauncherPMChild, SIGUSR2);
 		}
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
@@ -3255,6 +3256,37 @@ LaunchMissingBackgroundProcesses(void)
 		maybe_start_bgworkers();
 }
 
+/*
+ * Return string representation of signal.
+ *
+ * Because this is only implemented in signals we already rely on in this file
+ * we don't need to deal with unimplemented or same-numeric-value signals (as
+ * we'd e.g. have to for EWOULDBLOCK / EAGAIN).
+ */
+static const char *
+pm_signame(int signal)
+{
+#define PM_CASE(state) case state: return #state
+	switch (signal)
+	{
+			PM_CASE(SIGABRT);
+			PM_CASE(SIGCHLD);
+			PM_CASE(SIGHUP);
+			PM_CASE(SIGINT);
+			PM_CASE(SIGKILL);
+			PM_CASE(SIGQUIT);
+			PM_CASE(SIGTERM);
+			PM_CASE(SIGUSR1);
+			PM_CASE(SIGUSR2);
+		default:
+			/* all signals sent by postmaster should be listed here */
+			Assert(false);
+			return "(unknown)";
+	}
+#undef PM_CASE
+	pg_unreachable();
+}
+
 /*
  * Send a signal to a postmaster child process
  *
@@ -3272,10 +3304,16 @@ LaunchMissingBackgroundProcesses(void)
  * child twice will not cause any problems.
  */
 static void
-signal_child(PMChild *pmchild, int signal)
+signal_child_ext(PMChild *pmchild, int signal, int elevel)
 {
 	pid_t		pid = pmchild->pid;
 
+	ereport(elevel,
+			(errmsg_internal("sending signal %d/%s to %s process %d",
+							 signal, pm_signame(signal),
+							 GetBackendTypeDesc(pmchild->bkend_type),
+							 (int) pmchild->pid)));
+
 	if (kill(pid, signal) < 0)
 		elog(DEBUG3, "kill(%ld,%d) failed: %m", (long) pid, signal);
 #ifdef HAVE_SETSID
@@ -3295,6 +3333,15 @@ signal_child(PMChild *pmchild, int signal)
 #endif
 }
 
+/*
+ * Like signal_child_ext(), but with a default debug level.
+ */
+static void
+signal_child(PMChild *pmchild, int signal)
+{
+	signal_child_ext(pmchild, signal, DEBUG4);
+}
+
 /*
  * Convenience function for killing a child process after a crash of some
  * other child process.  We log the action at a higher level than we would
@@ -3306,11 +3353,8 @@ signal_child(PMChild *pmchild, int signal)
 static void
 sigquit_child(PMChild *pmchild)
 {
-	ereport(DEBUG2,
-			(errmsg_internal("sending %s to process %d",
-							 (send_abort_for_crash ? "SIGABRT" : "SIGQUIT"),
-							 (int) pmchild->pid)));
-	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
+	signal_child_ext(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT),
+					 DEBUG2);
 }
 
 /*
@@ -3341,9 +3385,6 @@ SignalChildren(int signal, BackendTypeMask targetMask)
 		if (!btmask_contains(targetMask, bp->bkend_type))
 			continue;
 
-		ereport(DEBUG4,
-				(errmsg_internal("sending signal %d to %s process %d",
-								 signal, GetBackendTypeDesc(bp->bkend_type), (int) bp->pid)));
 		signal_child(bp, signal);
 		signaled = true;
 	}
-- 
2.45.2.746.g06e570c0df.dirty

v1-0003-postmaster-Introduce-variadic-btmask_all_except.patchtext/x-diff; charset=us-asciiDownload
From 7f78e0a6609f9bfb33430137d083012d7bd15ad7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 19:51:00 -0500
Subject: [PATCH v1 3/4] postmaster: Introduce variadic btmask_all_except()

Upcoming patches would otherwise need btmask_all_except3().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/w3z6w3g4aovivs735nk4pzjhmegntecesm3kktpebchegm5o53@aonnq2kn27xi
Backpatch:
---
 src/backend/postmaster/postmaster.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d57fee54e0d..d65b00f9338 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -164,23 +164,20 @@ btmask_del(BackendTypeMask mask, BackendType t)
 }
 
 static inline BackendTypeMask
-btmask_all_except(BackendType t)
+btmask_all_except_n(int nargs, BackendType *t)
 {
 	BackendTypeMask mask = BTYPE_MASK_ALL;
 
-	mask = btmask_del(mask, t);
+	for (int i = 0; i < nargs; i++)
+		mask = btmask_del(mask, t[i]);
 	return mask;
 }
 
-static inline BackendTypeMask
-btmask_all_except2(BackendType t1, BackendType t2)
-{
-	BackendTypeMask mask = BTYPE_MASK_ALL;
-
-	mask = btmask_del(mask, t1);
-	mask = btmask_del(mask, t2);
-	return mask;
-}
+#define btmask_all_except(...) \
+	btmask_all_except_n( \
+		lengthof(((BackendType[]){__VA_ARGS__})), \
+		(BackendType[]){__VA_ARGS__} \
+	)
 
 static inline bool
 btmask_contains(BackendTypeMask mask, BackendType t)
@@ -2980,7 +2977,7 @@ PostmasterStateMachine(void)
 		 * left by now anyway; what we're really waiting for is walsenders and
 		 * archiver.
 		 */
-		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-- 
2.45.2.746.g06e570c0df.dirty

v1-0004-WIP-Change-shutdown-sequence-to-terminate-checkpo.patchtext/x-diff; charset=us-asciiDownload
From 5d7f521ea233fae2ac1e0956ada8d65db508c0ee Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 21:06:51 -0500
Subject: [PATCH v1 4/4] WIP: Change shutdown sequence to terminate
 checkpointer last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. That already happens in
checkpointer, even though walsenders can be active longer.

The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.  Another need for this originates in the AIO patchset, where IO workers
can emit stats in some edge cases and need to run while the shutdown
checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.

Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER PMState.

In addition, the existing PM_SHUTDOWN_2 state is renamed to
PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 103 +++++++++----
 src/backend/postmaster/postmaster.c           | 139 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 170 insertions(+), 76 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..44e25994ca7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -51,6 +51,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +142,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -161,6 +163,7 @@ static void UpdateSharedMemoryConfig(void);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownXLOG(SIGNAL_ARGS);
 
 
 /*
@@ -192,12 +195,12 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
-	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
-	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	pqsignal(SIGUSR2, ReqShutdownXLOG);
 
 	/*
 	 * Reset some signals that are accepted by postmaster but not here
@@ -214,8 +217,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster.c will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -330,7 +336,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint.
 	 */
 	for (;;)
 	{
@@ -349,7 +355,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -522,6 +531,8 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			/* We may have received an interrupt during the checkpoint. */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -560,6 +571,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By seperating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -589,29 +650,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -732,6 +770,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -876,6 +915,14 @@ ReqCheckpointHandler(SIGNAL_ARGS)
 	SetLatch(MyLatch);
 }
 
+/* SIGUSR2: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
 
 /* --------------------------------
  *		communication with backends
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d65b00f9338..2fe8e2b23b1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -324,9 +324,10 @@ typedef enum
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
 	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown
 								 * ckpt */
-	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
+	PM_XLOG_IS_SHUTDOWN,		/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
+	PM_SHUTDOWN_CHECKPOINTER,	/* waiting for checkpointer to shut down */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
 
@@ -2348,35 +2349,16 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know it wrote a shutdown checkpoint,
+				 * otherwise we'd still be in PM_SHUTDOWN.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_SHUTDOWN state) but we might
-				 * have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point we have no children left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_SHUTDOWN_2);
+				UpdatePMState(PM_NO_CHILDREN);
 			}
 			else
 			{
@@ -2924,9 +2906,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2969,19 +2951,19 @@ PostmasterStateMachine(void)
 		}
 	}
 
-	if (pmState == PM_SHUTDOWN_2)
+	if (pmState == PM_XLOG_IS_SHUTDOWN)
 	{
 		/*
-		 * PM_SHUTDOWN_2 state ends when there's no other children than
-		 * dead-end children left. There shouldn't be any regular backends
-		 * left by now anyway; what we're really waiting for is walsenders and
-		 * archiver.
+		 * PM_XLOG_IS_SHUTDOWN state ends when there's no children other than
+		 * checkpointer and dead-end children left. There shouldn't be any
+		 * regular backends left by now anyway; what we're really waiting for
+		 * is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND, B_CHECKPOINTER)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER, B_CHECKPOINTER));
 		}
 	}
 
@@ -2989,29 +2971,51 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_DEAD_END state ends when all other children are gone except
-		 * for the logger.  During normal shutdown, all that remains are
-		 * dead-end backends, but in FatalError processing we jump straight
-		 * here with more processes remaining.  Note that they have already
-		 * been sent appropriate shutdown signals, either during a normal
-		 * state transition leading up to PM_WAIT_DEAD_END, or during
-		 * FatalError processing.
+		 * for the logger and checkpointer.  During normal shutdown, all that
+		 * remains are dead-end backends and checkpointer, but in FatalError
+		 * processing we jump straight here with more processes remaining.
+		 * Note that they have already been sent appropriate shutdown signals,
+		 * either during a normal state transition leading up to
+		 * PM_WAIT_DEAD_END, or during FatalError processing.
 		 *
 		 * The reason we wait is to protect against a new postmaster starting
 		 * conflicting subprocesses; this isn't an ironclad protection, but it
 		 * at least helps in the shutdown-and-immediately-restart scenario.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_CHECKPOINTER)) == 0)
 		{
 			/* These other guys should be dead already */
 			Assert(StartupPMChild == NULL);
 			Assert(WalReceiverPMChild == NULL);
 			Assert(WalSummarizerPMChild == NULL);
 			Assert(BgWriterPMChild == NULL);
-			Assert(CheckpointerPMChild == NULL);
 			Assert(WalWriterPMChild == NULL);
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
+
+			UpdatePMState(PM_SHUTDOWN_CHECKPOINTER);
+
+			/*
+			 * Now that everyone else is gone, tell checkpointer to shut down
+			 * too. That allows checkpointer to perform some last bits of of
+			 * cleanup without other processes interfering.
+			 */
+			SignalChildren(SIGTERM, btmask(B_CHECKPOINTER));
+		}
+	}
+
+	if (pmState == PM_SHUTDOWN_CHECKPOINTER)
+	{
+		/*
+		 * PM_SHUTDOWN_CHECKPOINTER ends when, drumroll, checkpointer has shut
+		 * down. Note that checkpointer already has completed the shutdown
+		 * checkpoint, we just wait for it to do some last bits of cleanup and
+		 * then exit.
+		 */
+		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		{
+			Assert(CheckpointerPMChild == NULL);
 			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
@@ -3119,8 +3123,9 @@ pmstate_name(PMState state)
 			PM_CASE(PM_STOP_BACKENDS);
 			PM_CASE(PM_WAIT_BACKENDS);
 			PM_CASE(PM_SHUTDOWN);
-			PM_CASE(PM_SHUTDOWN_2);
+			PM_CASE(PM_XLOG_IS_SHUTDOWN);
 			PM_CASE(PM_WAIT_DEAD_END);
+			PM_CASE(PM_SHUTDOWN_CHECKPOINTER);
 			PM_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_CASE
@@ -3559,6 +3564,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3670,9 +3677,46 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (pmState == PM_SHUTDOWN &&
+		CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/*
+		 * Checkpointer completed the shutdown checkpoint.
+		 *
+		 * If we have an archiver subprocess, tell it to do a last archive
+		 * cycle and quit. Likewise, if we have walsender processes, tell them
+		 * to send any remaining WAL and quit.
+		 */
+		Assert(Shutdown > NoShutdown);
+
+		/* Waken archiver for the last time */
+		if (PgArchPMChild != NULL)
+			signal_child(PgArchPMChild, SIGUSR2);
+
+		/*
+		 * Waken walsenders for the last time. No regular backends should be
+		 * around anymore.
+		 */
+		SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+		UpdatePMState(PM_XLOG_IS_SHUTDOWN);
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3680,7 +3724,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -3991,8 +4035,9 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_SHUTDOWN_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
-		case PM_SHUTDOWN_2:
+		case PM_XLOG_IS_SHUTDOWN:
 		case PM_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.45.2.746.g06e570c0df.dirty

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#1)
Re: Reorder shutdown sequence, to flush pgstats later

On 08/01/2025 04:10, Andres Freund wrote:

I instead opted to somewhat rearrange the shutdown sequence:

This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.

Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER PMState.

In addition, the existing PM_SHUTDOWN_2 state is renamed to
PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.

Sounds good.

This patch is not fully polished, there are a few details I'm not sure about:

- Previously the shutdown checkpoint and process exit were triggered from
HandleCheckpointerInterrupts(). I could have continued doing so in the
patch, but it seemed quite weird to have a wait loop in a function called
HandleCheckpointerInterrupts().

Instead I made the main loop in CheckpointerMain() terminate if checkpointer
was asked to write the shutdown checkpoint or exit

- In process_pm_pmsignal() the code now triggers a PostmasterStateMachine() if
checkpointer signalled ShutdownXLOG having completed. We didn't really have
a need for that before. process_pm_child_exit() does call
PostmasterStateMachine(), but somehow calling it in process_pm_pmsignal()
feels slightly off

Looks good to me.

- I don't love the naming of the various PMState values (existing and new),
but a larger renaming should probably be done separately?

We currently have PM_WAIT_BACKENDS and PM_WAIT_DEAD_END. Maybe follow
that example and name all the shutdown states after what you're waiting for:

PM_WAIT_BACKENDS, /* waiting for live backends to exit */
PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown
ckpt */
PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
finish */
PM_WAIT_DEAD_END, /* waiting for dead-end children to exit */
PM_WAIT_CHECKPOINTER, /* waiting for checkpointer to shut down */
PM_NO_CHILDREN, /* all important children have exited */

- There's logging in ShutdownXLOG() that's only triggered if called from
checkpointer. Seems kinda odd - shouldn't we just move that to
checkpointer.c?

What logging do you mean? The "LOG: shutting down" message? No
objections to moving it, although it doesn't bother me either way.

The first two patches are just logging improvements that I found helpful to
write this patch:

0001: Instead of modifying pmState directly, do so via a new function
UpdatePMState(), which logs the state transition at DEBUG2.

Requires a helper to stringify PMState.

I've written one-off versions of this patch quite a few times. Without
knowing in what state postmaster is in, it's quite hard to debug the
state machine.

+1

elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
pmState, pmstate_name(pmState), newState, pmstate_name(newState));

I think the state names are enough, no need to print the numerical values.

0002: Make logging of postmaster signalling child processes more consistent

I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().

+1

I don't think the cases where DEBUG2 or DEBUG4 are used currently are
very well thought out. To save a few lines of code, I'd be happy with
merging signal_child_ext() and signal_child() and just always use DEBUG2
or DEBUG4. Or DEBUG3 as a compromise :-).

The next is a small prerequisite:

0003: postmaster: Introduce variadic btmask_all_except()

I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.

Nice. A variadic btmask_add() might be handy too.

And then 0004, the reason for this thread.

Overall, looks good to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Wed, Jan 08, 2025 at 03:10:03PM +0200, Heikki Linnakangas wrote:

On 08/01/2025 04:10, Andres Freund wrote:

elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
pmState, pmstate_name(pmState), newState, pmstate_name(newState));

I think the state names are enough, no need to print the numerical values.

+1

0002: Make logging of postmaster signalling child processes more consistent

I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().

+1

Random comments:

=== 1

+static const char *
+pm_signame(int signal)
+{
+#define PM_CASE(state) case state: return #state
+       switch (signal)

What about?
"
#define PM_SIGNAL_CASE(sig) case sig: return #sig
"

to better reflect its context.

=== 2

+               default:
+                       /* all signals sent by postmaster should be listed here */
+                       Assert(false);
+                       return "(unknown)";
+       }
+#undef PM_CASE
+       pg_unreachable();

Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
like dead code).

I don't think the cases where DEBUG2 or DEBUG4 are used currently are very
well thought out. To save a few lines of code, I'd be happy with merging
signal_child_ext() and signal_child()

+1

The next is a small prerequisite:

0003: postmaster: Introduce variadic btmask_all_except()

I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.

LGTM.

Nice. A variadic btmask_add() might be handy too.

And then 0004, the reason for this thread.

Did not give it a detailled review, but IIUC it now provides this flow:

PM_SHUTDOWN->PM_XLOG_IS_SHUTDOWN->PM_WAIT_DEAD_END->PM_SHUTDOWN_CHECKPOINTER->PM_NO_CHILDREN

and that looks good to me to fix the issue.

A few random comments:

=== 3

+ postmaster.c will only
+ * signal checkpointer to exit after all processes that could emit stats
+ * have been shut down.

s/postmaster.c/PostmasterStateMachine()/?

=== 4

+ * too. That allows checkpointer to perform some last bits of of

Typo "of of"

I'll give 0004 a closer look once it leaves the WIP state but want to share that
it looks like a good way to fix the issue.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#4Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#2)
7 attachment(s)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On 2025-01-08 15:10:03 +0200, Heikki Linnakangas wrote:

On 08/01/2025 04:10, Andres Freund wrote:

I instead opted to somewhat rearrange the shutdown sequence:
- I don't love the naming of the various PMState values (existing and new),
but a larger renaming should probably be done separately?

We currently have PM_WAIT_BACKENDS and PM_WAIT_DEAD_END. Maybe follow that
example and name all the shutdown states after what you're waiting for:

PM_WAIT_BACKENDS, /* waiting for live backends to exit */
PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown ckpt
*/
PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
finish */
PM_WAIT_DEAD_END, /* waiting for dead-end children to exit */
PM_WAIT_CHECKPOINTER, /* waiting for checkpointer to shut down */
PM_NO_CHILDREN, /* all important children have exited */

I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
but I can't think of anything better.

It's probably best to break it out into a separate commit, see attached.

- There's logging in ShutdownXLOG() that's only triggered if called from
checkpointer. Seems kinda odd - shouldn't we just move that to
checkpointer.c?

What logging do you mean? The "LOG: shutting down" message? No objections to
moving it, although it doesn't bother me either way.

Yea. It's also an oddly unspecific message :).

The first two patches are just logging improvements that I found helpful to
write this patch:

0001: Instead of modifying pmState directly, do so via a new function
UpdatePMState(), which logs the state transition at DEBUG2.

Requires a helper to stringify PMState.

I've written one-off versions of this patch quite a few times. Without
knowing in what state postmaster is in, it's quite hard to debug the
state machine.

+1

elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
pmState, pmstate_name(pmState), newState, pmstate_name(newState));

I think the state names are enough, no need to print the numerical values.

I thought the numeric value was mildly helpful as the numerically higher value
makes it slightly easier to understand the "direction" of state changes. But I
also wasn't sure it's worth it. Removed in the attached.

0002: Make logging of postmaster signalling child processes more consistent

I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().

+1

I don't think the cases where DEBUG2 or DEBUG4 are used currently are very
well thought out. To save a few lines of code, I'd be happy with merging
signal_child_ext() and signal_child() and just always use DEBUG2 or DEBUG4.
Or DEBUG3 as a compromise :-).

Heh. DEBUG2 seems a bit much compared to some other stuff, e.g. a config
reload will be logged at DEBUG2, seems a bit superfluous to then also use
DEBUG2 for all individual signals.

I was a bit hesitant to lower the level in sigquit_child(), but thinking about
it, the fact that we're gonna signal children is already obvious from the log.

So I went for your compromise position :)

The next is a small prerequisite:

0003: postmaster: Introduce variadic btmask_all_except()

I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.

Nice. A variadic btmask_add() might be handy too.

It does seem to make the code a bit more readable. See attached. Wasn't quite
sure in how many btmask_add() invocations to break it up...

And then 0004, the reason for this thread.

One problem I found is this comment:
* Note: we deliberately ignore SIGTERM, because during a standard Unix
* system shutdown cycle, init will SIGTERM all processes at once. We
* want to wait for the backends to exit, whereupon the postmaster will
* tell us it's okay to shut down (via SIGUSR2).

Which would cause problems with v1 of the patch, as postmaster would write a
shutdown checkpoint in response to SIGTERM.

Unfortunately we don't have any obvious unused signals available, as SIGINT is
used to request a checkpoint.

To deal with that I added a new patch to the stack, which makes
RequestCheckpoint() use SetLatch() on checkpointer's latch to wake it up,
instead of kill(SIGINT). Not that that matters, but using SetLatch()
directly, instead of doing SetLatch() in the signal handler, is slightly more
efficient, because it avoids a signal interrupt when checkpointer isn't
waiting on a latch.

The relevant comment mentioned "or is in process of restarting" - but that's
afaict bogus, we don't allow checkpointer to restart.

After that, postmaster can use SIGINT to request checkpointer to write the
shutdown checkpoint.

I'm currently to plan the four patches relatively soon, unless somebody speaks
up, of course. They seem fairly uncontroversial. The renaming of the phases
doesn't need to wait much longer, I think.

The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.

Greetings,

Andres Freund

Attachments:

v2-0001-postmaster-Update-pmState-in-new-function.patchtext/x-diff; charset=us-asciiDownload
From 51052e513c195aa901fb0d4fbc548ff37627bccb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 17:59:43 -0500
Subject: [PATCH v2 1/7] postmaster: Update pmState in new function

This makes logging of state changes easier - state transitions are now logged
at DEBUG1.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 78 +++++++++++++++++++++--------
 1 file changed, 57 insertions(+), 21 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0a915dac4f7..de278b0850d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -411,6 +411,7 @@ static void HandleChildCrash(int pid, int exitstatus, const char *procname);
 static void LogChildExit(int lev, const char *procname,
 						 int pid, int exitstatus);
 static void PostmasterStateMachine(void);
+static void UpdatePMState(PMState newState);
 
 static void ExitPostmaster(int status) pg_attribute_noreturn();
 static int	ServerLoop(void);
@@ -1363,7 +1364,7 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	pmState = PM_STARTUP;
+	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2099,7 +2100,7 @@ process_pm_shutdown_request(void)
 			else if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* There should be no clients, so proceed to stop children */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2133,7 +2134,7 @@ process_pm_shutdown_request(void)
 			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* Just shut down background processes silently */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 			else if (pmState == PM_RUN ||
 					 pmState == PM_HOT_STANDBY)
@@ -2141,7 +2142,7 @@ process_pm_shutdown_request(void)
 				/* Report that we're about to zap live client sessions */
 				ereport(LOG,
 						(errmsg("aborting any active transactions")));
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2176,7 +2177,7 @@ process_pm_shutdown_request(void)
 			/* (note we don't apply send_abort_for_crash here) */
 			SetQuitSignalReason(PMQUIT_FOR_STOP);
 			TerminateChildren(SIGQUIT);
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 
 			/* set stopwatch for them to die */
 			AbortStartTime = time(NULL);
@@ -2231,7 +2232,7 @@ process_pm_child_exit(void)
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2243,7 +2244,7 @@ process_pm_child_exit(void)
 				StartupStatus = STARTUP_NOT_RUNNING;
 				Shutdown = Max(Shutdown, SmartShutdown);
 				TerminateChildren(SIGTERM);
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2288,7 +2289,7 @@ process_pm_child_exit(void)
 				{
 					StartupStatus = STARTUP_NOT_RUNNING;
 					if (pmState == PM_STARTUP)
-						pmState = PM_WAIT_BACKENDS;
+						UpdatePMState(PM_WAIT_BACKENDS);
 				}
 				else
 					StartupStatus = STARTUP_CRASHED;
@@ -2304,7 +2305,7 @@ process_pm_child_exit(void)
 			FatalError = false;
 			AbortStartTime = 0;
 			ReachedNormalRunning = true;
-			pmState = PM_RUN;
+			UpdatePMState(PM_RUN);
 			connsAllowed = true;
 
 			/*
@@ -2377,7 +2378,7 @@ process_pm_child_exit(void)
 				 */
 				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
 
-				pmState = PM_SHUTDOWN_2;
+				UpdatePMState(PM_SHUTDOWN_2);
 			}
 			else
 			{
@@ -2729,7 +2730,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		pmState == PM_RUN ||
 		pmState == PM_STOP_BACKENDS ||
 		pmState == PM_SHUTDOWN)
-		pmState = PM_WAIT_BACKENDS;
+		UpdatePMState(PM_WAIT_BACKENDS);
 
 	/*
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2821,7 +2822,7 @@ PostmasterStateMachine(void)
 			 * Then we're ready to stop other children.
 			 */
 			if (CountChildren(btmask(B_BACKEND)) == 0)
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 		}
 	}
 
@@ -2909,7 +2910,7 @@ PostmasterStateMachine(void)
 
 			SignalChildren(SIGTERM, targetMask);
 
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 		}
 
 		/* Are any of the target processes still running? */
@@ -2920,7 +2921,7 @@ PostmasterStateMachine(void)
 				/*
 				 * Stop any dead-end children and stop creating new ones.
 				 */
-				pmState = PM_WAIT_DEAD_END;
+				UpdatePMState(PM_WAIT_DEAD_END);
 				ConfigurePostmasterWaitSet(false);
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
@@ -2945,7 +2946,7 @@ PostmasterStateMachine(void)
 				if (CheckpointerPMChild != NULL)
 				{
 					signal_child(CheckpointerPMChild, SIGUSR2);
-					pmState = PM_SHUTDOWN;
+					UpdatePMState(PM_SHUTDOWN);
 				}
 				else
 				{
@@ -2960,7 +2961,7 @@ PostmasterStateMachine(void)
 					 * for checkpointer fork failure.
 					 */
 					FatalError = true;
-					pmState = PM_WAIT_DEAD_END;
+					UpdatePMState(PM_WAIT_DEAD_END);
 					ConfigurePostmasterWaitSet(false);
 
 					/* Kill the walsenders and archiver too */
@@ -2980,7 +2981,7 @@ PostmasterStateMachine(void)
 		 */
 		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			pmState = PM_WAIT_DEAD_END;
+			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
 			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 		}
@@ -3013,7 +3014,7 @@ PostmasterStateMachine(void)
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
-			pmState = PM_NO_CHILDREN;
+			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
 
@@ -3097,7 +3098,7 @@ PostmasterStateMachine(void)
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		pmState = PM_STARTUP;
+		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3106,6 +3107,41 @@ PostmasterStateMachine(void)
 	}
 }
 
+static const char *
+pmstate_name(PMState state)
+{
+#define PM_TOSTR_CASE(state) case state: return #state
+	switch (state)
+	{
+			PM_TOSTR_CASE(PM_INIT);
+			PM_TOSTR_CASE(PM_STARTUP);
+			PM_TOSTR_CASE(PM_RECOVERY);
+			PM_TOSTR_CASE(PM_HOT_STANDBY);
+			PM_TOSTR_CASE(PM_RUN);
+			PM_TOSTR_CASE(PM_STOP_BACKENDS);
+			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
+			PM_TOSTR_CASE(PM_SHUTDOWN);
+			PM_TOSTR_CASE(PM_SHUTDOWN_2);
+			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_NO_CHILDREN);
+	}
+#undef PM_TOSTR_CASE
+
+	pg_unreachable();
+}
+
+/*
+ * Simple wrapper for updating pmState. The main reason to have this wrapper
+ * is that it makes it easy to log all state transitions.
+ */
+static void
+UpdatePMState(PMState newState)
+{
+	elog(DEBUG1, "updating PMState from %s to %s",
+		 pmstate_name(pmState), pmstate_name(newState));
+	pmState = newState;
+}
+
 /*
  * Launch background processes after state change, or relaunch after an
  * existing process has exited.
@@ -3524,7 +3560,7 @@ process_pm_pmsignal(void)
 #endif
 		}
 
-		pmState = PM_RECOVERY;
+		UpdatePMState(PM_RECOVERY);
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
@@ -3539,7 +3575,7 @@ process_pm_pmsignal(void)
 		sd_notify(0, "READY=1");
 #endif
 
-		pmState = PM_HOT_STANDBY;
+		UpdatePMState(PM_HOT_STANDBY);
 		connsAllowed = true;
 
 		/* Some workers may be scheduled to start now */
-- 
2.45.2.746.g06e570c0df.dirty

v2-0002-postmaster-Improve-logging-of-signals-sent-by-pos.patchtext/x-diff; charset=us-asciiDownload
From 28652f322f7a345a0c06cc46a9d5659afdc7f62f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 20:22:36 -0500
Subject: [PATCH v2 2/7] postmaster: Improve logging of signals sent by
 postmaster

Previously many, in some cases important, signals we never logged. In other
cases the signal name was only included numerically.

As part of this, change the debug log level the signal is logged at to DEBUG3,
previously some where DEBUG2, some DEBUG4.

Also move from direct use of kill() to signal the av launcher to
signal_child(). There doesn't seem to be a reason for directly using kill().

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 55 ++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index de278b0850d..51ec097cd43 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1693,7 +1693,7 @@ ServerLoop(void)
 		{
 			avlauncher_needs_signal = false;
 			if (AutoVacLauncherPMChild != NULL)
-				kill(AutoVacLauncherPMChild->pid, SIGUSR2);
+				signal_child(AutoVacLauncherPMChild, SIGUSR2);
 		}
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
@@ -3255,6 +3255,37 @@ LaunchMissingBackgroundProcesses(void)
 		maybe_start_bgworkers();
 }
 
+/*
+ * Return string representation of signal.
+ *
+ * Because this is only implemented in signals we already rely on in this file
+ * we don't need to deal with unimplemented or same-numeric-value signals (as
+ * we'd e.g. have to for EWOULDBLOCK / EAGAIN).
+ */
+static const char *
+pm_signame(int signal)
+{
+#define PM_TOSTR_CASE(state) case state: return #state
+	switch (signal)
+	{
+			PM_TOSTR_CASE(SIGABRT);
+			PM_TOSTR_CASE(SIGCHLD);
+			PM_TOSTR_CASE(SIGHUP);
+			PM_TOSTR_CASE(SIGINT);
+			PM_TOSTR_CASE(SIGKILL);
+			PM_TOSTR_CASE(SIGQUIT);
+			PM_TOSTR_CASE(SIGTERM);
+			PM_TOSTR_CASE(SIGUSR1);
+			PM_TOSTR_CASE(SIGUSR2);
+		default:
+			/* all signals sent by postmaster should be listed here */
+			Assert(false);
+			return "(unknown)";
+	}
+#undef PM_TOSTR_CASE
+	pg_unreachable();
+}
+
 /*
  * Send a signal to a postmaster child process
  *
@@ -3276,6 +3307,12 @@ signal_child(PMChild *pmchild, int signal)
 {
 	pid_t		pid = pmchild->pid;
 
+	ereport(DEBUG3,
+			(errmsg_internal("sending signal %d/%s to %s process %d",
+							 signal, pm_signame(signal),
+							 GetBackendTypeDesc(pmchild->bkend_type),
+							 (int) pmchild->pid)));
+
 	if (kill(pid, signal) < 0)
 		elog(DEBUG3, "kill(%ld,%d) failed: %m", (long) pid, signal);
 #ifdef HAVE_SETSID
@@ -3297,19 +3334,14 @@ signal_child(PMChild *pmchild, int signal)
 
 /*
  * Convenience function for killing a child process after a crash of some
- * other child process.  We log the action at a higher level than we would
- * otherwise do, and we apply send_abort_for_crash to decide which signal
- * to send.  Normally it's SIGQUIT -- and most other comments in this file
- * are written on the assumption that it is -- but developers might prefer
- * to use SIGABRT to collect per-child core dumps.
+ * other child process.  We apply send_abort_for_crash to decide which signal
+ * to send.  Normally it's SIGQUIT -- and most other comments in this file are
+ * written on the assumption that it is -- but developers might prefer to use
+ * SIGABRT to collect per-child core dumps.
  */
 static void
 sigquit_child(PMChild *pmchild)
 {
-	ereport(DEBUG2,
-			(errmsg_internal("sending %s to process %d",
-							 (send_abort_for_crash ? "SIGABRT" : "SIGQUIT"),
-							 (int) pmchild->pid)));
 	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
 }
 
@@ -3341,9 +3373,6 @@ SignalChildren(int signal, BackendTypeMask targetMask)
 		if (!btmask_contains(targetMask, bp->bkend_type))
 			continue;
 
-		ereport(DEBUG4,
-				(errmsg_internal("sending signal %d to %s process %d",
-								 signal, GetBackendTypeDesc(bp->bkend_type), (int) bp->pid)));
 		signal_child(bp, signal);
 		signaled = true;
 	}
-- 
2.45.2.746.g06e570c0df.dirty

v2-0003-postmaster-Introduce-variadic-btmask_all_except.patchtext/x-diff; charset=us-asciiDownload
From 0ce39702fc396075ee87f7063e775302fd6fb791 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 19:51:00 -0500
Subject: [PATCH v2 3/7] postmaster: Introduce variadic btmask_all_except()

Upcoming patches would otherwise need btmask_all_except3().

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/w3z6w3g4aovivs735nk4pzjhmegntecesm3kktpebchegm5o53@aonnq2kn27xi
---
 src/backend/postmaster/postmaster.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 51ec097cd43..f9d6c552341 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -164,23 +164,20 @@ btmask_del(BackendTypeMask mask, BackendType t)
 }
 
 static inline BackendTypeMask
-btmask_all_except(BackendType t)
+btmask_all_except_n(int nargs, BackendType *t)
 {
 	BackendTypeMask mask = BTYPE_MASK_ALL;
 
-	mask = btmask_del(mask, t);
+	for (int i = 0; i < nargs; i++)
+		mask = btmask_del(mask, t[i]);
 	return mask;
 }
 
-static inline BackendTypeMask
-btmask_all_except2(BackendType t1, BackendType t2)
-{
-	BackendTypeMask mask = BTYPE_MASK_ALL;
-
-	mask = btmask_del(mask, t1);
-	mask = btmask_del(mask, t2);
-	return mask;
-}
+#define btmask_all_except(...) \
+	btmask_all_except_n( \
+		lengthof(((BackendType[]){__VA_ARGS__})), \
+		(BackendType[]){__VA_ARGS__} \
+	)
 
 static inline bool
 btmask_contains(BackendTypeMask mask, BackendType t)
@@ -2979,7 +2976,7 @@ PostmasterStateMachine(void)
 		 * left by now anyway; what we're really waiting for is walsenders and
 		 * archiver.
 		 */
-		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-- 
2.45.2.746.g06e570c0df.dirty

v2-0004-postmaster-Make-btmask_add-variadic.patchtext/x-diff; charset=us-asciiDownload
From d463ec7e1dfaae92d9bd4626448688b9269ebb1b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Jan 2025 14:10:49 -0500
Subject: [PATCH v2 4/7] postmaster: Make btmask_add() variadic

Suggested-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/d2cd8fd3-396a-4390-8f0b-74be65e72899@iki.fi
---
 src/backend/postmaster/postmaster.c | 54 ++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 20 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f9d6c552341..05e93ac167b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -150,12 +150,19 @@ btmask(BackendType t)
 }
 
 static inline BackendTypeMask
-btmask_add(BackendTypeMask mask, BackendType t)
+btmask_add_n(BackendTypeMask mask, int nargs, BackendType *t)
 {
-	mask.mask |= 1 << t;
+	for (int i = 0; i < nargs; i++)
+		mask.mask |= 1 << t[i];
 	return mask;
 }
 
+#define btmask_add(mask, ...) \
+	btmask_add_n(mask, \
+		lengthof(((BackendType[]){__VA_ARGS__})), \
+		(BackendType[]){__VA_ARGS__} \
+	)
+
 static inline BackendTypeMask
 btmask_del(BackendTypeMask mask, BackendType t)
 {
@@ -2840,22 +2847,27 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends, no
 		 * autovac launcher or workers, and no bgworkers (including
-		 * unconnected ones).  No walwriter, bgwriter, slot sync worker, or
-		 * WAL summarizer either.
+		 * unconnected ones).
 		 */
-		targetMask = btmask_add(targetMask, B_BACKEND);
-		targetMask = btmask_add(targetMask, B_AUTOVAC_LAUNCHER);
-		targetMask = btmask_add(targetMask, B_AUTOVAC_WORKER);
-		targetMask = btmask_add(targetMask, B_BG_WORKER);
+		targetMask = btmask_add(targetMask,
+								B_BACKEND,
+								B_AUTOVAC_LAUNCHER,
+								B_AUTOVAC_WORKER,
+								B_BG_WORKER);
 
-		targetMask = btmask_add(targetMask, B_WAL_WRITER);
-		targetMask = btmask_add(targetMask, B_BG_WRITER);
-		targetMask = btmask_add(targetMask, B_SLOTSYNC_WORKER);
-		targetMask = btmask_add(targetMask, B_WAL_SUMMARIZER);
+		/*
+		 * No walwriter, bgwriter, slot sync worker, or WAL summarizer either.
+		 */
+		targetMask = btmask_add(targetMask,
+								B_WAL_WRITER,
+								B_BG_WRITER,
+								B_SLOTSYNC_WORKER,
+								B_WAL_SUMMARIZER);
 
 		/* If we're in recovery, also stop startup and walreceiver procs */
-		targetMask = btmask_add(targetMask, B_STARTUP);
-		targetMask = btmask_add(targetMask, B_WAL_RECEIVER);
+		targetMask = btmask_add(targetMask,
+								B_STARTUP,
+								B_WAL_RECEIVER);
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
@@ -2878,17 +2890,19 @@ PostmasterStateMachine(void)
 		{
 			BackendTypeMask remainMask = BTYPE_MASK_NONE;
 
-			remainMask = btmask_add(remainMask, B_WAL_SENDER);
-			remainMask = btmask_add(remainMask, B_ARCHIVER);
-			remainMask = btmask_add(remainMask, B_DEAD_END_BACKEND);
-			remainMask = btmask_add(remainMask, B_LOGGER);
+			remainMask = btmask_add(remainMask,
+									B_WAL_SENDER,
+									B_ARCHIVER,
+									B_DEAD_END_BACKEND,
+									B_LOGGER);
 
 			/* checkpointer may or may not be in targetMask already */
 			remainMask = btmask_add(remainMask, B_CHECKPOINTER);
 
 			/* these are not real postmaster children */
-			remainMask = btmask_add(remainMask, B_INVALID);
-			remainMask = btmask_add(remainMask, B_STANDALONE_BACKEND);
+			remainMask = btmask_add(remainMask,
+									B_INVALID,
+									B_STANDALONE_BACKEND);
 
 			/* All types should be included in targetMask or remainMask */
 			Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);
-- 
2.45.2.746.g06e570c0df.dirty

v2-0005-postmaster-Rename-some-shutdown-related-PMState-p.patchtext/x-diff; charset=us-asciiDownload
From 8781ea45bb8e4f25215ebf5a4e31a3476dec6d99 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Jan 2025 14:05:57 -0500
Subject: [PATCH v2 5/7] postmaster: Rename some shutdown related PMState phase
 names

The previous names weren't particularly clear. Future patches will add more
shutdown phases, making it even more important to have understandable shutdown
phases.

Suggested-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/d2cd8fd3-396a-4390-8f0b-74be65e72899@iki.fi
---
 src/backend/postmaster/postmaster.c | 48 +++++++++++++++--------------
 src/backend/replication/README      |  4 +--
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 05e93ac167b..58f2619c45f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -316,9 +316,10 @@ static bool FatalError = false; /* T if recovering from backend crash */
  * Notice that this state variable does not distinguish *why* we entered
  * states later than PM_RUN --- Shutdown and FatalError must be consulted
  * to find that out.  FatalError is never true in PM_RECOVERY, PM_HOT_STANDBY,
- * or PM_RUN states, nor in PM_SHUTDOWN states (because we don't enter those
- * states when trying to recover from a crash).  It can be true in PM_STARTUP
- * state, because we don't clear it until we've successfully started WAL redo.
+ * or PM_RUN states, nor in PM_WAIT_XLOG_SHUTDOWN states (because we don't
+ * enter those states when trying to recover from a crash).  It can be true in
+ * PM_STARTUP state, because we don't clear it until we've successfully
+ * started WAL redo.
  */
 typedef enum
 {
@@ -329,9 +330,9 @@ typedef enum
 	PM_RUN,						/* normal "database is alive" state */
 	PM_STOP_BACKENDS,			/* need to stop remaining backends */
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
-	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown
+	PM_WAIT_XLOG_SHUTDOWN,		/* waiting for checkpointer to do shutdown
 								 * ckpt */
-	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
+	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -2354,7 +2355,7 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
@@ -2363,8 +2364,8 @@ process_pm_child_exit(void)
 				 * occur on next postmaster start.)
 				 *
 				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_SHUTDOWN state) but we might
-				 * have dead-end children to wait for.
+				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
+				 * we might have dead-end children to wait for.
 				 *
 				 * If we have an archiver subprocess, tell it to do a last
 				 * archive cycle and quit. Likewise, if we have walsender
@@ -2382,7 +2383,7 @@ process_pm_child_exit(void)
 				 */
 				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
 
-				UpdatePMState(PM_SHUTDOWN_2);
+				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
 			}
 			else
 			{
@@ -2733,7 +2734,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		pmState == PM_HOT_STANDBY ||
 		pmState == PM_RUN ||
 		pmState == PM_STOP_BACKENDS ||
-		pmState == PM_SHUTDOWN)
+		pmState == PM_WAIT_XLOG_SHUTDOWN)
 		UpdatePMState(PM_WAIT_BACKENDS);
 
 	/*
@@ -2957,7 +2958,7 @@ PostmasterStateMachine(void)
 				if (CheckpointerPMChild != NULL)
 				{
 					signal_child(CheckpointerPMChild, SIGUSR2);
-					UpdatePMState(PM_SHUTDOWN);
+					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
 				{
@@ -2982,13 +2983,13 @@ PostmasterStateMachine(void)
 		}
 	}
 
-	if (pmState == PM_SHUTDOWN_2)
+	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_SHUTDOWN_2 state ends when there's no other children than
-		 * dead-end children left. There shouldn't be any regular backends
-		 * left by now anyway; what we're really waiting for is walsenders and
-		 * archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
+		 * than dead-end children left. There shouldn't be any regular
+		 * backends left by now anyway; what we're really waiting for is
+		 * walsenders and archiver.
 		 */
 		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
@@ -3131,8 +3132,8 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_RUN);
 			PM_TOSTR_CASE(PM_STOP_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
-			PM_TOSTR_CASE(PM_SHUTDOWN);
-			PM_TOSTR_CASE(PM_SHUTDOWN_2);
+			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
+			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
@@ -3172,9 +3173,10 @@ LaunchMissingBackgroundProcesses(void)
 	 * The checkpointer and the background writer are active from the start,
 	 * until shutdown is initiated.
 	 *
-	 * (If the checkpointer is not running when we enter the PM_SHUTDOWN
-	 * state, it is launched one more time to perform the shutdown checkpoint.
-	 * That's done in PostmasterStateMachine(), not here.)
+	 * (If the checkpointer is not running when we enter the
+	 * PM_WAIT_XLOG_SHUTDOWN state, it is launched one more time to perform
+	 * the shutdown checkpoint.  That's done in PostmasterStateMachine(), not
+	 * here.)
 	 */
 	if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 		pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
@@ -3994,8 +3996,8 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
-		case PM_SHUTDOWN_2:
-		case PM_SHUTDOWN:
+		case PM_WAIT_XLOG_ARCHIVAL:
+		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 8fcd78da9aa..f05d2aae5f9 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -45,8 +45,8 @@ shutdown checkpoint and terminating pgarch and other auxiliary processes, but
 that's not desirable for walsenders, because we want the standby servers to
 receive all the WAL, including the shutdown checkpoint, before the primary
 is shut down. Therefore postmaster treats walsenders like the pgarch process,
-and instructs them to terminate at PM_SHUTDOWN_2 phase, after all regular
-backends have died and checkpointer has issued the shutdown checkpoint.
+and instructs them to terminate at the PM_WAIT_XLOG_ARCHIVAL phase, after all
+regular backends have died and checkpointer has issued the shutdown checkpoint.
 
 When postmaster accepts a connection, it immediately forks a new process
 to handle the handshake and authentication, and the process initializes to
-- 
2.45.2.746.g06e570c0df.dirty

v2-0006-checkpointer-Request-checkpoint-via-latch-instead.patchtext/x-diff; charset=us-asciiDownload
From e956c53bea042524f7905e7d457149a559322d5d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Jan 2025 13:32:29 -0500
Subject: [PATCH v2 6/7] checkpointer: Request checkpoint via latch instead of
 signal

The main reason for this is that a future commit would like to use SIGINT for
another purpose. But it's also a tad nicer and tad more efficient to use
SetLatch(), as it avoids a signal when checkpointer already is busy.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/checkpointer.c | 60 +++++++++------------------
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..cd76315bfa4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
 
 /*
  * Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
-/* --------------------------------
- *		signal handler routines
- * --------------------------------
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
-	SetLatch(MyLatch);
-}
-
-
 /* --------------------------------
  *		communication with backends
  * --------------------------------
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without getting a signal.
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-- 
2.45.2.746.g06e570c0df.dirty

v2-0007-Change-shutdown-sequence-to-terminate-checkpointe.patchtext/x-diff; charset=us-asciiDownload
From 02bbfc3b1cda477c41687972e20ac662e3483a5e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Jan 2025 14:09:28 -0500
Subject: [PATCH v2 7/7] Change shutdown sequence to terminate checkpointer
 last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. That already happens in
checkpointer, even though walsenders can be active longer.

The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.  Another need for this originates in the AIO patchset, where IO workers
can emit stats in some edge cases and need to run while the shutdown
checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.

Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_WAIT_CHECKPOINTER PMState.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 114 +++++++++++----
 src/backend/postmaster/postmaster.c           | 136 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 180 insertions(+), 74 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cd76315bfa4..411aa66aa61 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -51,6 +51,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +142,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -159,6 +161,9 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
+/* Signal handlers */
+static void ReqShutdownXLOG(SIGNAL_ARGS);
+
 
 /*
  * Main entry point for checkpointer process
@@ -188,7 +193,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -211,8 +216,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -327,7 +335,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint or terminate.
 	 */
 	for (;;)
 	{
@@ -346,7 +354,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -517,8 +528,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -557,6 +573,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By seperating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -586,29 +652,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -729,6 +772,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -857,6 +901,20 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
+
 /* --------------------------------
  *		communication with backends
  * --------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 58f2619c45f..87ddc7a2da0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -335,6 +335,7 @@ typedef enum
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
 
@@ -2355,35 +2356,16 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know it wrote a shutdown checkpoint,
+				 * otherwise we'd still be in PM_SHUTDOWN.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point we have no children left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_NO_CHILDREN);
 			}
 			else
 			{
@@ -2938,9 +2920,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2954,10 +2936,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -2986,16 +2968,16 @@ PostmasterStateMachine(void)
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+		 * than checkpointer and dead-end children left. There shouldn't be
+		 * any regular backends left by now anyway; what we're really waiting
+		 * for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND, B_CHECKPOINTER)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER, B_CHECKPOINTER));
 		}
 	}
 
@@ -3003,29 +2985,52 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_DEAD_END state ends when all other children are gone except
-		 * for the logger.  During normal shutdown, all that remains are
-		 * dead-end backends, but in FatalError processing we jump straight
-		 * here with more processes remaining.  Note that they have already
-		 * been sent appropriate shutdown signals, either during a normal
-		 * state transition leading up to PM_WAIT_DEAD_END, or during
-		 * FatalError processing.
+		 * for the logger and checkpointer.  During normal shutdown, all that
+		 * remains are dead-end backends and checkpointer, but in FatalError
+		 * processing we jump straight here with more processes remaining.
+		 * Note that they have already been sent appropriate shutdown signals,
+		 * either during a normal state transition leading up to
+		 * PM_WAIT_DEAD_END, or during FatalError processing.
 		 *
 		 * The reason we wait is to protect against a new postmaster starting
 		 * conflicting subprocesses; this isn't an ironclad protection, but it
 		 * at least helps in the shutdown-and-immediately-restart scenario.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_CHECKPOINTER)) == 0)
 		{
 			/* These other guys should be dead already */
 			Assert(StartupPMChild == NULL);
 			Assert(WalReceiverPMChild == NULL);
 			Assert(WalSummarizerPMChild == NULL);
 			Assert(BgWriterPMChild == NULL);
-			Assert(CheckpointerPMChild == NULL);
 			Assert(WalWriterPMChild == NULL);
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
+
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that everyone else is gone, tell checkpointer to shut down
+			 * too. That allows checkpointer to perform some last bits of
+			 * cleanup without other processes interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
+		}
+	}
+
+	if (pmState == PM_WAIT_CHECKPOINTER)
+	{
+		/*
+		 * PM_WAIT_CHECKPOINTER ends when, drumroll, checkpointer has shut
+		 * down. Note that checkpointer already has completed the shutdown
+		 * checkpoint, we just wait for it to do some last bits of cleanup and
+		 * then exit.
+		 */
+		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		{
+			Assert(CheckpointerPMChild == NULL);
 			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
@@ -3135,6 +3140,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3563,6 +3569,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3674,9 +3682,46 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (pmState == PM_WAIT_XLOG_SHUTDOWN &&
+		CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/*
+		 * Checkpointer completed the shutdown checkpoint.
+		 *
+		 * If we have an archiver subprocess, tell it to do a last archive
+		 * cycle and quit. Likewise, if we have walsender processes, tell them
+		 * to send any remaining WAL and quit.
+		 */
+		Assert(Shutdown > NoShutdown);
+
+		/* Waken archiver for the last time */
+		if (PgArchPMChild != NULL)
+			signal_child(PgArchPMChild, SIGUSR2);
+
+		/*
+		 * Waken walsenders for the last time. No regular backends should be
+		 * around anymore.
+		 */
+		SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+		UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3684,7 +3729,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -3995,6 +4040,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.45.2.746.g06e570c0df.dirty

#5Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#3)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On 2025-01-08 14:48:21 +0000, Bertrand Drouvot wrote:

On Wed, Jan 08, 2025 at 03:10:03PM +0200, Heikki Linnakangas wrote:

On 08/01/2025 04:10, Andres Freund wrote:

0002: Make logging of postmaster signalling child processes more consistent

I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().

+1

Random comments:

=== 1

+static const char *
+pm_signame(int signal)
+{
+#define PM_CASE(state) case state: return #state
+       switch (signal)

What about?
"
#define PM_SIGNAL_CASE(sig) case sig: return #sig
"
to better reflect its context.

I went for PM_TOSTR_CASE - that way the same name can be used in
pmstate_name().

=== 2

+               default:
+                       /* all signals sent by postmaster should be listed here */
+                       Assert(false);
+                       return "(unknown)";
+       }
+#undef PM_CASE
+       pg_unreachable();

Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
like dead code).

I don't think so - we're not listing all signals here, just ones that
postmaster is currently using. They're also typically not defined in an enum
allowing the compiler to warn about uncovered values. It seemed better to
actually return something in a production build rather than aborting.

Some compilers tend to be pretty annoying about missing return values, that's
why I added the pg_unreachable(). Perhaps I should have done a
return "" /* silence compilers */;

or such.

Nice. A variadic btmask_add() might be handy too.

And then 0004, the reason for this thread.

Did not give it a detailled review, but IIUC it now provides this flow:

PM_SHUTDOWN->PM_XLOG_IS_SHUTDOWN->PM_WAIT_DEAD_END->PM_SHUTDOWN_CHECKPOINTER->PM_NO_CHILDREN

and that looks good to me to fix the issue.

A few random comments:

=== 3

+ postmaster.c will only
+ * signal checkpointer to exit after all processes that could emit stats
+ * have been shut down.

s/postmaster.c/PostmasterStateMachine()/?

I just went for 'postmaster' (without the .c) - I don't think it's really
useful to specifically reference s/postmaster.c/PostmasterStateMachine() in
checkpointer.c. But I could be swayed if you feel strongly.

=== 4

+ * too. That allows checkpointer to perform some last bits of of

Typo "of of"

Fixed.

I'll give 0004 a closer look once it leaves the WIP state but want to share that
it looks like a good way to fix the issue.

Cool.

Thanks for the review,

Andres Freund

#6Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#5)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Wed, Jan 08, 2025 at 02:32:24PM -0500, Andres Freund wrote:

On 2025-01-08 14:48:21 +0000, Bertrand Drouvot wrote:

=== 2

+               default:
+                       /* all signals sent by postmaster should be listed here */
+                       Assert(false);
+                       return "(unknown)";
+       }
+#undef PM_CASE
+       pg_unreachable();

Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
like dead code).

I don't think so - we're not listing all signals here, just ones that
postmaster is currently using. They're also typically not defined in an enum
allowing the compiler to warn about uncovered values.

Oh right, the parameter is an "int" not an "enum" (and anyway, as you said, we're
not listing all the signals) (did not pay attention to that). So we may need to
"help" some compilers regarding missing return values.

It seemed better to
actually return something in a production build rather than aborting.

Yeah, fully agree.

Perhaps I should have done a
return "" /* silence compilers */;

Using pg_unreachable() is fine but a comment (like the return value you're
suggesting above) could help.

Thanks for the explanation!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#7Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#4)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Wed, Jan 08, 2025 at 02:26:15PM -0500, Andres Freund wrote:

I'm currently to plan the four patches relatively soon, unless somebody speaks
up, of course. They seem fairly uncontroversial. The renaming of the phases
doesn't need to wait much longer, I think.

Thanks for the patches.

A few comments:

0001 LGTM.

0002,

=== 1

+static const char *
+pm_signame(int signal)
+{
+#define PM_TOSTR_CASE(state) case state: return #state
+   switch (signal)

s/state/signal/? (seems better in the pm_signame() context)

0003 and 0004 LGTM.

0005,

=== 2

+ PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to

I don't love PM_WAIT_XLOG_ARCHIVAL, but I can't think of anything better.

PM_WAIT_ARCHIVER_WALSENDERS maybe? (that would follow the pattern of naming
the processes like PM_WAIT_BACKENDS, PM_WAIT_CHECKPOINTER for example).

That said, I'm not 100% convinced it makes it more clear though...

The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.

0006,

=== 3

+ * when it does start, with or without getting a signal.

s/getting a signal/getting a latch set/ or "getting notified"?

=== 4

+  if (checkpointerProc == INVALID_PROC_NUMBER)
   {
      if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
      {
          elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-               "could not signal for checkpoint: checkpointer is not running");
+                "could not notify checkpoint: checkpointer is not running");

Worth renaming MAX_SIGNAL_TRIES with MAX_NOTIFY_TRIES then?

0007,

=== 5

+ pqsignal(SIGINT, ReqShutdownXLOG);

Worth a comment like:

pqsignal(SIGINT, ReqShutdownXLOG); /* trigger shutdown checkpoint */

=== 6

+ * Wait until we're asked to shut down. By seperating the writing of the

Typo: s/seperating/separating/

I don't really anything else in 0006 and 0007 but as you said it's probably
worth a few more eyes as the code change is pretty substantial.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#8Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andres Freund (#4)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

Thanks for working on this!

On Wed, 8 Jan 2025 at 22:26, Andres Freund <andres@anarazel.de> wrote:

I'm currently to plan the four patches relatively soon, unless somebody speaks
up, of course. They seem fairly uncontroversial. The renaming of the phases
doesn't need to wait much longer, I think.

The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.

Some minor comments:

=== 0001

LGTM.

=== 0002

+    }
+#undef PM_TOSTR_CASE
+    pg_unreachable();
+}

Maybe a blank line after '#undef PM_TOSTR_CASE' (or no blank line at
the end of the pmstate_name() at 0001)?

+    ereport(DEBUG3,
+            (errmsg_internal("sending signal %d/%s to %s process %d",

I am not sure if that makes sense but with the addition of the backend
type here, I think 'sending signal %d/%s to %s process with the pid of
%d' would be clearer.

=== 0003

LGTM.

=== 0004

LGTM.

=== 0005

I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
but I can't think of anything better.

I liked this, I think it is good enough.

-    PM_SHUTDOWN,                /* waiting for checkpointer to do shutdown
+    PM_WAIT_XLOG_SHUTDOWN,        /* waiting for checkpointer to do shutdown
                                  * ckpt */

There are couple of variables and functions which include pm_shutdown
in their names:
pending_pm_shutdown_request
handle_pm_shutdown_request_signal()
process_pm_shutdown_request()

Do you think these need to be updated as well?

=== 0006

Please see below.

=== 0007

-                 * told to shut down.  We expect that it wrote a shutdown
-                 * checkpoint.  (If for some reason it didn't, recovery will
-                 * occur on next postmaster start.)
+                 * told to shut down.  We know it wrote a shutdown checkpoint,
+                 * otherwise we'd still be in PM_SHUTDOWN.

s/PM_SHUTDOWN/PM_WAIT_XLOG_SHUTDOWN/

I have these comments for now. I need to study 0006 and 0007 more.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#9Andres Freund
andres@anarazel.de
In reply to: Nazir Bilal Yavuz (#8)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On 2025-01-09 16:50:45 +0300, Nazir Bilal Yavuz wrote:

On Wed, 8 Jan 2025 at 22:26, Andres Freund <andres@anarazel.de> wrote:
=== 0005

I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
but I can't think of anything better.

I liked this, I think it is good enough.

-    PM_SHUTDOWN,                /* waiting for checkpointer to do shutdown
+    PM_WAIT_XLOG_SHUTDOWN,        /* waiting for checkpointer to do shutdown
* ckpt */

There are couple of variables and functions which include pm_shutdown
in their names:
pending_pm_shutdown_request
handle_pm_shutdown_request_signal()
process_pm_shutdown_request()

Do you think these need to be updated as well?

I don't think so - I think those are actually not specifically referencing the
PM_SHUTDOWN symbol. They're referencing shutting down postmaster - which
neither starts nor ends with PM_SHUTDOWN.

Greetings,

Andres Freund

#10Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#7)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

Thanks for the reviews!

On 2025-01-09 12:01:05 +0000, Bertrand Drouvot wrote:

On Wed, Jan 08, 2025 at 02:26:15PM -0500, Andres Freund wrote:
=== 2

+ PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to

I don't love PM_WAIT_XLOG_ARCHIVAL, but I can't think of anything better.

PM_WAIT_ARCHIVER_WALSENDERS maybe? (that would follow the pattern of naming
the processes like PM_WAIT_BACKENDS, PM_WAIT_CHECKPOINTER for example).

That said, I'm not 100% convinced it makes it more clear though...

Yea, I just went with PM_WAIT_XLOG_ARCHIVAL, it's ok enough.

The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.

0006,

=== 3

+ * when it does start, with or without getting a signal.

s/getting a signal/getting a latch set/ or "getting notified"?

I went with "with or without the SetLatch()".

=== 4

+  if (checkpointerProc == INVALID_PROC_NUMBER)
{
if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
{
elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-               "could not signal for checkpoint: checkpointer is not running");
+                "could not notify checkpoint: checkpointer is not running");

Worth renaming MAX_SIGNAL_TRIES with MAX_NOTIFY_TRIES then?

IDK, didn't seem worth it. SetLatch() is a form of signalling and I don't
think "notify" really is better. And it requires changing more lines...

=== 5

+ pqsignal(SIGINT, ReqShutdownXLOG);

Worth a comment like:

pqsignal(SIGINT, ReqShutdownXLOG); /* trigger shutdown checkpoint */

That seems just a restatement of the function name, don't think we gain
anything by that.

Greetings,

Andres Freund

#11Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#7)
2 attachment(s)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

I've pushed 0001-0005.

I think 0006 isn't far from ready.

However, I think 0007 needs a bit more work.

One thing that worries me a bit is that using SIGINT for triggering the
shutdown checkpoint could somehow be unintentionally triggered? Previously
checkpointer would effectively have ignored that, but now it'd bring the whole
cluster down (because postmaster would see an unexpected ShutdownXLOG()). But
I can't really see a legitimate reason for checkpointer to get spurious
SIGINTs.

This made me think about what happens if a spurious SIGINT is sent - and in
the last version the answer wasn't great, namely everything seemed to go on
normal, except that the cluster couldn't be shut down normally. There wasn't
any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
fatal. I think this needs to be treated the same way we treat not being able
to fork checkpointer during a shutdown. Now receiving
PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
FatalError processing after logging "WAL was shut down unexpectedly". We have
multiple copies of code to go to FatalError, that doesn't seem great.

Another thing I started worrying about is that the commit added
PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
where we directly jump to PM_WAIT_DEAD_END in fatal situations:

/*
* Stop any dead-end children and stop creating new ones.
*/
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);
SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
and
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);

I don't think this actively causes problems today, but it seems rather iffy.

Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
thought it'd be better to shut checkpointer down after even dead-end children
exited, in case we ever wanted to introduce stats for dead-end children - but
that seems rather unlikely?

I also noticed that checkpointer.c intro comment needed some editing.

Attached are the rebased and edited last two patches.

Independent of this patch, we seem to deal with FatalError processing in a
somewhat inconsistently. We have this comment (in master):
/*
* PM_WAIT_DEAD_END state ends when all other children are gone except
* for the logger. During normal shutdown, all that remains are
* dead-end backends, but in FatalError processing we jump straight
* here with more processes remaining. Note that they have already
* been sent appropriate shutdown signals, either during a normal
* state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.

However that doesn't actually appear to be true in the most common way to
reach FatalError, via HandleChildCrash():

if (Shutdown != ImmediateShutdown)
FatalError = true;

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_WAIT_XLOG_SHUTDOWN)
UpdatePMState(PM_WAIT_BACKENDS);

Here we
a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDS

From PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
other than walsender, archiver and dead-end children exited.

b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
PM_WAIT_BACKENDS?

This comment seems to have been added in

commit a78af0427015449269fb7d9d8c6057cfcb740149
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-11-14 16:12:28 +0200

Assign a child slot to every postmaster child process

ISTM that section of the comment is largely wrong and has never really been
the case if my git sleuthing is correct?

We do have one place that directly switches to PM_WAIT_DEAD_END:
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);

but I suspect that's just about unreachable these days. If checkpointer exits
outside of shutdown processing, it's treated as as a reason to crash-restart:
/*
* Any unexpected exit of the checkpointer (including FATAL
* exit) is treated as a crash.
*/
HandleChildCrash(pid, exitstatus,
_("checkpointer process"));

During postmaster's first start checkpointer is launched before the startup
process is even started. During crash restart processing, we do start the
startup process before checkpointer, but start checkpointer in
LaunchMissingBackgroundProcesses().

It looks like it's theoretically possible that recovery completes before ever
reach LaunchMissingBackgroundProcesses() - but it seems like we should address
that by having the same "process startup order" during crash recovery
processing as we have during the first start?

I think this oddity originated in

commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
Author: Thomas Munro <tmunro@postgresql.org>
Date: 2021-08-02 17:32:20 +1200

Run checkpointer and bgwriter in crash recovery.

ISTM that we should change it so that the first and crash recovery scenarios
are more similar. Thomas, what do you think?

Separately, is it a problem that somewhere in process_pm_* we did
ConfigurePostmasterWaitSet(false) but then the event loop in ServerLoop()
continues to process "old" WL_SOCKET_ACCEPT events?

Greetings,

Andres Freund

Attachments:

v3-0001-checkpointer-Request-checkpoint-via-latch-instead.patchtext/x-diff; charset=us-asciiDownload
From 16289910f50ed92aee08d1320c31235e83273999 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Jan 2025 11:11:40 -0500
Subject: [PATCH v3 1/2] checkpointer: Request checkpoint via latch instead of
 signal

The main reason for this is that a future commit would like to use SIGINT for
another purpose. But it's also a tad nicer and tad more efficient to use
SetLatch(), as it avoids a signal when checkpointer already is busy.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/checkpointer.c | 60 +++++++++------------------
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..dd2c8376c6e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
 
 /*
  * Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
-/* --------------------------------
- *		signal handler routines
- * --------------------------------
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
-	SetLatch(MyLatch);
-}
-
-
 /* --------------------------------
  *		communication with backends
  * --------------------------------
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without the SetLatch().
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-- 
2.45.2.746.g06e570c0df.dirty

v3-0002-Change-shutdown-sequence-to-terminate-checkpointe.patchtext/x-diff; charset=us-asciiDownload
From 38a8cba733683e37db630240e683eef09498a078 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Jan 2025 11:45:13 -0500
Subject: [PATCH v3 2/2] Change shutdown sequence to terminate checkpointer
 last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. Serializing stats already
happens in checkpointer, even though walsenders can be active longer.

The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.  Another need for this originates in the AIO patchset, where IO
workers (which, in some edge cases, can emit stats of their own) need to run
while the shutdown checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled (via
SIGINT) to trigger writing the shutdown checkpoint without terminating
it. Once checkpointer wrote the checkpoint it will wait for a termination
signal (SIGUSR2, as before).

Postmaster now triggers the shutdown checkpoint via SIGINT, where we
previously did so by terminating checkpointer. Checkpointer now is terminated
after all other children have been terminated, tracked using the new
PM_WAIT_CHECKPOINTER PMState.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 125 +++++++++----
 src/backend/postmaster/postmaster.c           | 165 +++++++++++++-----
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 216 insertions(+), 78 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index dd2c8376c6e..767bf9f5cf8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,10 +10,13 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * Normal termination is by SIGUSR2, which instructs the checkpointer to
- * execute a shutdown checkpoint and then exit(0).  (All backends must be
- * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
- * like any backend, the checkpointer will simply abort and exit on SIGQUIT.
+ * The normal termination sequence is that checkpointer is instructed to
+ * execute the shutdown checkpoint by SIGINT.  After that checkpointer waits
+ * to be terminated via SIGUSR2, which instructs the checkpointer to exit(0).
+ * All backends must be stopped before SIGINT or SIGUSR2 is issued!
+ *
+ * Emergency termination is by SIGQUIT; like any backend, the checkpointer
+ * will simply abort and exit on SIGQUIT.
  *
  * If the checkpointer exits unexpectedly, the postmaster treats that the same
  * as a backend crash: shared memory may be corrupted, so remaining backends
@@ -51,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +145,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -159,6 +164,9 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
+/* Signal handlers */
+static void ReqShutdownXLOG(SIGNAL_ARGS);
+
 
 /*
  * Main entry point for checkpointer process
@@ -188,7 +196,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -211,8 +219,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -327,7 +338,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint or terminate.
 	 */
 	for (;;)
 	{
@@ -346,7 +357,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -517,8 +531,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -557,6 +576,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By separating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -586,29 +655,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -729,6 +775,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -857,6 +904,20 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
+
 /* --------------------------------
  *		communication with backends
  * --------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..831a2afbd8f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -335,6 +335,7 @@ typedef enum
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
 
@@ -2355,35 +2356,17 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know checkpointer wrote a shutdown
+				 * checkpoint, otherwise we'd still be in
+				 * PM_WAIT_XLOG_SHUTDOWN state.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point we have no children left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_NO_CHILDREN);
 			}
 			else
 			{
@@ -2938,9 +2921,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2954,10 +2937,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -2986,16 +2969,16 @@ PostmasterStateMachine(void)
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+		 * than checkpointer and dead-end children left. There shouldn't be
+		 * any regular backends left by now anyway; what we're really waiting
+		 * for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND, B_CHECKPOINTER)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER, B_CHECKPOINTER));
 		}
 	}
 
@@ -3003,29 +2986,52 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_DEAD_END state ends when all other children are gone except
-		 * for the logger.  During normal shutdown, all that remains are
-		 * dead-end backends, but in FatalError processing we jump straight
-		 * here with more processes remaining.  Note that they have already
-		 * been sent appropriate shutdown signals, either during a normal
-		 * state transition leading up to PM_WAIT_DEAD_END, or during
-		 * FatalError processing.
+		 * for the logger and checkpointer.  During normal shutdown, all that
+		 * remains are dead-end backends and checkpointer, but in FatalError
+		 * processing we jump straight here with more processes remaining.
+		 * Note that they have already been sent appropriate shutdown signals,
+		 * either during a normal state transition leading up to
+		 * PM_WAIT_DEAD_END, or during FatalError processing.
 		 *
 		 * The reason we wait is to protect against a new postmaster starting
 		 * conflicting subprocesses; this isn't an ironclad protection, but it
 		 * at least helps in the shutdown-and-immediately-restart scenario.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_CHECKPOINTER)) == 0)
 		{
 			/* These other guys should be dead already */
 			Assert(StartupPMChild == NULL);
 			Assert(WalReceiverPMChild == NULL);
 			Assert(WalSummarizerPMChild == NULL);
 			Assert(BgWriterPMChild == NULL);
-			Assert(CheckpointerPMChild == NULL);
 			Assert(WalWriterPMChild == NULL);
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
+
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that everyone else is gone, tell checkpointer to shut down
+			 * too. That allows checkpointer to perform some last bits of
+			 * cleanup without other processes interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
+		}
+	}
+
+	if (pmState == PM_WAIT_CHECKPOINTER)
+	{
+		/*
+		 * PM_WAIT_CHECKPOINTER ends when, drumroll, checkpointer has shut
+		 * down. Note that checkpointer already has completed the shutdown
+		 * checkpoint, we just wait for it to do some last bits of cleanup and
+		 * then exit.
+		 */
+		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		{
+			Assert(CheckpointerPMChild == NULL);
 			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
@@ -3135,6 +3141,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3565,6 +3572,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3676,9 +3685,74 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/* Checkpointer completed the shutdown checkpoint */
+		if (pmState == PM_WAIT_XLOG_SHUTDOWN)
+		{
+			/*
+			 * If we have an archiver subprocess, tell it to do a last archive
+			 * cycle and quit. Likewise, if we have walsender processes, tell them
+			 * to send any remaining WAL and quit.
+			 */
+			Assert(Shutdown > NoShutdown);
+
+			/* Waken archiver for the last time */
+			if (PgArchPMChild != NULL)
+				signal_child(PgArchPMChild, SIGUSR2);
+
+			/*
+			 * Waken walsenders for the last time. No regular backends should be
+			 * around anymore.
+			 */
+			SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+			UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+		}
+		else if (!FatalError && Shutdown < ImmediateShutdown)
+		{
+			/*
+			 * Checkpointer only ought to perform the shutdown checkpoint
+			 * during shutdown.  If somehow checkpointer did so in another
+			 * situation, we have no choice but to crash-restart.
+			 *
+			 * It's possible however that we get PMSIGNAL_XLOG_IS_SHUTDOWN
+			 * outside of PM_WAIT_XLOG_SHUTDOWN if an orderly shutdown was
+			 * "interrupted" by a crash or an immediate shutdown.
+			 *
+			 * FIXME: This should probably not have a copy fo the code to
+			 * transition into FatalError state.
+			 */
+			ereport(LOG,
+					(errmsg("WAL was shut down unexpectedly")));
+
+			FatalError = true;
+			UpdatePMState(PM_WAIT_DEAD_END);
+			ConfigurePostmasterWaitSet(false);
+
+			/*
+			 * Doesn't seem likely to help to take send_abort_for_crash into
+			 * account here.
+			 */
+			TerminateChildren(SIGQUIT);
+		}
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3686,7 +3760,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -3997,6 +4071,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.45.2.746.g06e570c0df.dirty

#12Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#11)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Fri, Jan 10, 2025 at 02:07:25PM -0500, Andres Freund wrote:

However, I think 0007 needs a bit more work.

One thing that worries me a bit is that using SIGINT for triggering the
shutdown checkpoint could somehow be unintentionally triggered? Previously
checkpointer would effectively have ignored that,

Right.

but now it'd bring the whole
cluster down (because postmaster would see an unexpected ShutdownXLOG()). But
I can't really see a legitimate reason for checkpointer to get spurious
SIGINTs.

Yeah, appart from human error on the OS side, I don't see any reasons too.

This made me think about what happens if a spurious SIGINT is sent - and in
the last version the answer wasn't great, namely everything seemed to go on
normal, except that the cluster couldn't be shut down normally.

Yeah, and I think that checkpoints would hang.

There wasn't
any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
fatal. I think this needs to be treated the same way we treat not being able
to fork checkpointer during a shutdown. Now receiving
PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
FatalError processing after logging "WAL was shut down unexpectedly".

I wonder if we could first ask the postmaster a confirmation that the SIGINT is
coming from it? (and if not, then simply ignore the SIGINT). I thought about a
shared memory flag but the postmaster has to be keept away from shared memory
operations, so that's not a valid idea. Another idea could be that the checkpointer
sends a new "confirm" request to the postmater first. But then I'm not sure how
the postmaster could reply back (given the signals that are already being used)
and that might overcomplicate things.

So yeah, might be better to be on the safe side of thing and "simply" enter Fatal.

We have
multiple copies of code to go to FatalError, that doesn't seem great.

+  * FIXME: This should probably not have a copy fo the code to
+  * transition into FatalError state.
+  */
+  ereport(LOG,
+          (errmsg("WAL was shut down unexpectedly")));

Indeed, might be worth extracting this into a helper function or such.

Another thing I started worrying about is that the commit added
PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
where we directly jump to PM_WAIT_DEAD_END in fatal situations:

/*
* Stop any dead-end children and stop creating new ones.
*/
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);
SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
and
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);

I don't think this actively causes problems today,

For the first case, we'd ask the checkpointer to do work in the PM_WAIT_DEAD_END
case while already SIGQUIT'd (through SignalHandlerForCrashExit()). But that's
not an issue per say as we check for CheckpointerPMChild != NULL when
pmState == PM_WAIT_DEAD_END.

For the second case (where we failed to fork a checkpointer), we'd also
check for CheckpointerPMChild != NULL when pmState == PM_WAIT_DEAD_END.

So in both case we'd ask a non existing checkpointer to do work but we later
check that it exists, so I also don't think that it causes any problem.

but it seems rather iffy.

Yeah, we intend to ask a non existing checkpointer process to do work...

Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
thought it'd be better to shut checkpointer down after even dead-end children
exited, in case we ever wanted to introduce stats for dead-end children - but
that seems rather unlikely?

Yeah, given the above, it looks more clean to change that ordering.

Indeed that seems unlikely to introduce dead-end children stats and even if we do
, we could think back to put PM_WAIT_CHECKPOINTER after PM_WAIT_DEAD_END (or any
oher way to ensure the stats are flushed after they are shut down). So it's
probably better to focus on the current state and then put
PM_WAIT_CHECKPOINTER before PM_WAIT_DEAD_END.

Anyway, if we reach the 2 cases mentioned above we're not going to flush the
stats anyway (CheckpointerPMChild is NULL) so better to reorder then.

Independent of this patch, we seem to deal with FatalError processing in a
somewhat inconsistently. We have this comment (in master):
/*
* PM_WAIT_DEAD_END state ends when all other children are gone except
* for the logger. During normal shutdown, all that remains are
* dead-end backends, but in FatalError processing we jump straight
* here with more processes remaining. Note that they have already
* been sent appropriate shutdown signals, either during a normal
* state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.

However that doesn't actually appear to be true in the most common way to
reach FatalError, via HandleChildCrash():

if (Shutdown != ImmediateShutdown)
FatalError = true;

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_WAIT_XLOG_SHUTDOWN)
UpdatePMState(PM_WAIT_BACKENDS);

Here we
a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDS

From PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
other than walsender, archiver and dead-end children exited.

b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
PM_WAIT_BACKENDS?

This comment seems to have been added in

commit a78af0427015449269fb7d9d8c6057cfcb740149
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-11-14 16:12:28 +0200

Assign a child slot to every postmaster child process

ISTM that section of the comment is largely wrong and has never really been
the case if my git sleuthing is correct?

I agree that does not look correct but I don't think it's coming from a78af0427015449269fb7d9d8c6057cfcb740149.
ISTM that a78af0427015449269fb7d9d8c6057cfcb740149 re-worded this already wrong
comment:

"
/*
* PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
* (ie, no dead_end children remain), and the archiver is gone too.
*
* The reason we wait for those two is to protect them against a new
* postmaster starting conflicting subprocesses; this isn't an
* ironclad protection, but it at least helps in the
* shutdown-and-immediately-restart scenario. Note that they have
* already been sent appropriate shutdown signals, either during a
* normal state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.
*/
"

while we could also see:

"
if (Shutdown != ImmediateShutdown)
FatalError = true;

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;
"

It seems to me that it has been introduced in e6a442c71b30f62e7b5eee6058afc961b1c7f29b
where the above comment has been added and does not align with what HandleChildCrash()
was already doing (was already setting PM_WAIT_BACKENDS).

But anyway I do agree that this comment look wrong.

We do have one place that directly switches to PM_WAIT_DEAD_END:
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);

but I suspect that's just about unreachable these days. If checkpointer exits
outside of shutdown processing, it's treated as as a reason to crash-restart:
/*
* Any unexpected exit of the checkpointer (including FATAL
* exit) is treated as a crash.
*/
HandleChildCrash(pid, exitstatus,
_("checkpointer process"));

I do agree. I was trying to reach this code path while replying to one of
the above comments (about PM_WAIT_DEAD_END ordering) but had to replace with
another if condition to reach it during testing.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#12)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On 2025-01-13 12:20:39 +0000, Bertrand Drouvot wrote:

On Fri, Jan 10, 2025 at 02:07:25PM -0500, Andres Freund wrote:

There wasn't
any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
fatal. I think this needs to be treated the same way we treat not being able
to fork checkpointer during a shutdown. Now receiving
PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
FatalError processing after logging "WAL was shut down unexpectedly".

I wonder if we could first ask the postmaster a confirmation that the SIGINT is
coming from it? (and if not, then simply ignore the SIGINT). I thought about a
shared memory flag but the postmaster has to be keept away from shared memory
operations, so that's not a valid idea. Another idea could be that the checkpointer
sends a new "confirm" request to the postmater first. But then I'm not sure how
the postmaster could reply back (given the signals that are already being used)
and that might overcomplicate things.

That sounds more complicated than it's worth it for a hypothetical risk.

Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
thought it'd be better to shut checkpointer down after even dead-end children
exited, in case we ever wanted to introduce stats for dead-end children - but
that seems rather unlikely?

Yeah, given the above, it looks more clean to change that ordering.

I'll post a version that does that in a bit.

Independent of this patch, we seem to deal with FatalError processing in a
somewhat inconsistently. We have this comment (in master):
/*
* PM_WAIT_DEAD_END state ends when all other children are gone except
* for the logger. During normal shutdown, all that remains are
* dead-end backends, but in FatalError processing we jump straight
* here with more processes remaining. Note that they have already
* been sent appropriate shutdown signals, either during a normal
* state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.

However that doesn't actually appear to be true in the most common way to
reach FatalError, via HandleChildCrash():

if (Shutdown != ImmediateShutdown)
FatalError = true;

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_WAIT_XLOG_SHUTDOWN)
UpdatePMState(PM_WAIT_BACKENDS);

Here we
a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDS

From PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
other than walsender, archiver and dead-end children exited.

b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
PM_WAIT_BACKENDS?

This comment seems to have been added in

commit a78af0427015449269fb7d9d8c6057cfcb740149
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-11-14 16:12:28 +0200

Assign a child slot to every postmaster child process

ISTM that section of the comment is largely wrong and has never really been
the case if my git sleuthing is correct?

I agree that does not look correct but I don't think it's coming from a78af0427015449269fb7d9d8c6057cfcb740149.
ISTM that a78af0427015449269fb7d9d8c6057cfcb740149 re-worded this already wrong
comment:

"
/*
* PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
* (ie, no dead_end children remain), and the archiver is gone too.
*
* The reason we wait for those two is to protect them against a new
* postmaster starting conflicting subprocesses; this isn't an
* ironclad protection, but it at least helps in the
* shutdown-and-immediately-restart scenario. Note that they have
* already been sent appropriate shutdown signals, either during a
* normal state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.
*/
"

Hm. This version doesn't really seem wrong in the same way / to the same
degree.

while we could also see:

"
if (Shutdown != ImmediateShutdown)
FatalError = true;

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;
"

I don't think this make the version of the comment you included above
inaccurate? It's saying that the signal has been sent in the state *leading
up" to PM_WAIT_DEAD_END, which does seem accurate? The code the comment is
about just won't be reached until postmaster transition to PM_WAIT_DEAD_END.

but I suspect that's just about unreachable these days. If checkpointer exits
outside of shutdown processing, it's treated as as a reason to crash-restart:
/*
* Any unexpected exit of the checkpointer (including FATAL
* exit) is treated as a crash.
*/
HandleChildCrash(pid, exitstatus,
_("checkpointer process"));

I do agree. I was trying to reach this code path while replying to one of
the above comments (about PM_WAIT_DEAD_END ordering) but had to replace with
another if condition to reach it during testing.

I was able to reach it unfortunately

1) Repeated fork failures of checkpointer can lead to it

2) When crash-restarting we, don't start checkpointer immediately, but defer
that until LaunchMissingBackgroundProcesses(). While not easy, it's
possible to trigger a shutdown before that happens.

Greetings,

Andres Freund

#14Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#12)
7 attachment(s)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On 2025-01-13 12:20:39 +0000, Bertrand Drouvot wrote:

We have
multiple copies of code to go to FatalError, that doesn't seem great.

+  * FIXME: This should probably not have a copy fo the code to
+  * transition into FatalError state.
+  */
+  ereport(LOG,
+          (errmsg("WAL was shut down unexpectedly")));

Indeed, might be worth extracting this into a helper function or such.

That is quite a rats nest of issues :(. There are quite a few oddities around
the related code. The attached series contains a few commits trying to unify
the paths transitioning to FatalError = true into the new HandleFatalError().

The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

After the earlier commit that just moves the code I turned the existing if ()
into a switch so that whoever adds new states is told to look at that code by
the compiler.

Thoughts?

Another thing I started worrying about is that the commit added
PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
where we directly jump to PM_WAIT_DEAD_END in fatal situations:

Attached patch is implemented that way.

Greetings,

Andres Freund

Attachments:

v4-0001-checkpointer-Request-checkpoint-via-latch-instead.patchtext/x-diff; charset=us-asciiDownload
From 833ee00ea2a1341b2e20e88c96d8824c09189df2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Jan 2025 11:11:40 -0500
Subject: [PATCH v4 1/7] checkpointer: Request checkpoint via latch instead of
 signal

The main reason for this is that a future commit would like to use SIGINT for
another purpose. But it's also a tad nicer and tad more efficient to use
SetLatch(), as it avoids a signal when checkpointer already is busy.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/checkpointer.c | 60 +++++++++------------------
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..dd2c8376c6e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
 
 /*
  * Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
-/* --------------------------------
- *		signal handler routines
- * --------------------------------
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
-	SetLatch(MyLatch);
-}
-
-
 /* --------------------------------
  *		communication with backends
  * --------------------------------
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without the SetLatch().
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-- 
2.45.2.746.g06e570c0df.dirty

v4-0002-postmaster-Don-t-open-code-TerminateChildren-in-H.patchtext/x-diff; charset=us-asciiDownload
From 67a3fbb10f46be4dfcaf5adeada4c8c30b5e01b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Jan 2025 23:20:25 -0500
Subject: [PATCH v4 2/7] postmaster: Don't open-code TerminateChildren() in
 HandleChildCrash()

After removing the duplication no user of sigquit_child() remains, therefore
remove it.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 42 +++--------------------------
 1 file changed, 4 insertions(+), 38 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..8153edc446c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -424,7 +424,6 @@ static int	BackendStartup(ClientSocket *client_sock);
 static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
 static CAC_state canAcceptConnections(BackendType backend_type);
 static void signal_child(PMChild *pmchild, int signal);
-static void sigquit_child(PMChild *pmchild);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
@@ -2699,32 +2698,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	/*
 	 * Signal all other child processes to exit.  The crashed process has
 	 * already been removed from ActiveChildList.
+	 *
+	 * We could exclude dead-end children here, but at least when sending
+	 * SIGABRT it seems better to include them.
 	 */
 	if (take_action)
-	{
-		dlist_iter	iter;
-
-		dlist_foreach(iter, &ActiveChildList)
-		{
-			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
-
-			/* We do NOT restart the syslogger */
-			if (bp == SysLoggerPMChild)
-				continue;
-
-			if (bp == StartupPMChild)
-				StartupStatus = STARTUP_SIGNALED;
-
-			/*
-			 * This backend is still alive.  Unless we did so already, tell it
-			 * to commit hara-kiri.
-			 *
-			 * We could exclude dead-end children here, but at least when
-			 * sending SIGABRT it seems better to include them.
-			 */
-			sigquit_child(bp);
-		}
-	}
+		TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
 
 	if (Shutdown != ImmediateShutdown)
 		FatalError = true;
@@ -3347,19 +3326,6 @@ signal_child(PMChild *pmchild, int signal)
 #endif
 }
 
-/*
- * Convenience function for killing a child process after a crash of some
- * other child process.  We apply send_abort_for_crash to decide which signal
- * to send.  Normally it's SIGQUIT -- and most other comments in this file are
- * written on the assumption that it is -- but developers might prefer to use
- * SIGABRT to collect per-child core dumps.
- */
-static void
-sigquit_child(PMChild *pmchild)
-{
-	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
-}
-
 /*
  * Send a signal to the targeted children.
  */
-- 
2.45.2.746.g06e570c0df.dirty

v4-0003-postmaster-Don-t-repeatedly-transition-to-crashin.patchtext/x-diff; charset=us-asciiDownload
From c2d8fb9f2779ff828ee3dfb9050e5866aa4011bb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Jan 2025 23:30:46 -0500
Subject: [PATCH v4 3/7] postmaster: Don't repeatedly transition to crashing
 state

Previously HandleChildCrash() skipped logging and signalling child exits if
already in an immediate shutdown or FatalError, but still transitioned server
state in response to a crash. That's redundant.

To make it easier to combine different paths for entering FatalError state,
only do so once.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8153edc446c..939b1b2ef82 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2676,8 +2676,6 @@ CleanupBackend(PMChild *bp,
 static void
 HandleChildCrash(int pid, int exitstatus, const char *procname)
 {
-	bool		take_action;
-
 	/*
 	 * We only log messages and send signals if this is the first process
 	 * crash and we're not doing an immediate shutdown; otherwise, we're only
@@ -2685,15 +2683,13 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
 
-	if (take_action)
-	{
-		LogChildExit(LOG, procname, pid, exitstatus);
-		ereport(LOG,
-				(errmsg("terminating any other active server processes")));
-		SetQuitSignalReason(PMQUIT_FOR_CRASH);
-	}
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
+	SetQuitSignalReason(PMQUIT_FOR_CRASH);
 
 	/*
 	 * Signal all other child processes to exit.  The crashed process has
@@ -2702,8 +2698,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * We could exclude dead-end children here, but at least when sending
 	 * SIGABRT it seems better to include them.
 	 */
-	if (take_action)
-		TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+	TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
 
 	if (Shutdown != ImmediateShutdown)
 		FatalError = true;
-- 
2.45.2.746.g06e570c0df.dirty

v4-0004-postmaster-Move-code-to-switch-into-FatalError-st.patchtext/x-diff; charset=us-asciiDownload
From 34f4d2732672579e1cba046d9e3a7c344a133dcf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:57:12 -0500
Subject: [PATCH v4 4/7] postmaster: Move code to switch into FatalError state
 into function

There are two places switching to FatalError mode, behaving somewhat
differently. An upcoming commit will introduce a third. That doesn't seem seem
like a good idea.

This commit just moves the FatalError related code from HandleChildCrash()
into its own function, a subsequent commit will evolve the state machine
change to be suitable for other callers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 70 +++++++++++++++++++----------
 1 file changed, 46 insertions(+), 24 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 939b1b2ef82..13d49eecd22 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2665,40 +2665,29 @@ CleanupBackend(PMChild *bp,
 }
 
 /*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, archiver, slot sync worker, or background worker.
- *
- * The objectives here are to clean up our local state about the child
- * process, and to signal all other remaining children to quickdie.
- *
- * The caller has already released its PMChild slot.
+ * Transition into FatalError state, in response to something bad having
+ * happened. Commonly the caller will have logged the reason for entering
+ * FatalError state.
  */
 static void
-HandleChildCrash(int pid, int exitstatus, const char *procname)
+HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 {
-	/*
-	 * We only log messages and send signals if this is the first process
-	 * crash and we're not doing an immediate shutdown; otherwise, we're only
-	 * here to update postmaster's idea of live processes.  If we have already
-	 * signaled children, nonzero exit status is to be expected, so don't
-	 * clutter log.
-	 */
-	if (FatalError || Shutdown == ImmediateShutdown)
-		return;
+	int			sigtosend;
+
+	SetQuitSignalReason(reason);
 
-	LogChildExit(LOG, procname, pid, exitstatus);
-	ereport(LOG,
-			(errmsg("terminating any other active server processes")));
-	SetQuitSignalReason(PMQUIT_FOR_CRASH);
+	if (consider_sigabrt && send_abort_for_crash)
+		sigtosend = SIGABRT;
+	else
+		sigtosend = SIGQUIT;
 
 	/*
-	 * Signal all other child processes to exit.  The crashed process has
-	 * already been removed from ActiveChildList.
+	 * Signal all other child processes to exit.
 	 *
 	 * We could exclude dead-end children here, but at least when sending
 	 * SIGABRT it seems better to include them.
 	 */
-	TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+	TerminateChildren(sigtosend);
 
 	if (Shutdown != ImmediateShutdown)
 		FatalError = true;
@@ -2719,6 +2708,39 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		AbortStartTime = time(NULL);
 }
 
+/*
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter, autovacuum, archiver, slot sync worker, or background worker.
+ *
+ * The objectives here are to clean up our local state about the child
+ * process, and to signal all other remaining children to quickdie.
+ *
+ * The caller has already released its PMChild slot.
+ */
+static void
+HandleChildCrash(int pid, int exitstatus, const char *procname)
+{
+	/*
+	 * We only log messages and send signals if this is the first process
+	 * crash and we're not doing an immediate shutdown; otherwise, we're only
+	 * here to update postmaster's idea of live processes.  If we have already
+	 * signaled children, nonzero exit status is to be expected, so don't
+	 * clutter log.
+	 */
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
+
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
+
+	/*
+	 * Switch into error state. The crashed process has already been removed
+	 * from ActiveChildList.
+	 */
+	HandleFatalError(PMQUIT_FOR_CRASH, true);
+}
+
 /*
  * Log the death of a child process.
  */
-- 
2.45.2.746.g06e570c0df.dirty

v4-0005-WIP-postmaster-Commonalize-FatalError-paths.patchtext/x-diff; charset=us-asciiDownload
From 4df0e91aa56886db63e2a9552120b2c92def1982 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:25:01 -0500
Subject: [PATCH v4 5/7] WIP: postmaster: Commonalize FatalError paths

This includes some behavioural changes:

- Previously PM_WAIT_XLOG_ARCHIVAL wasn't handled in HandleFatalError(), that
  doesn't seem quite right.
- Failure to fork checkpointer now transitions through PM_WAIT_BACKENDS, like
  child crashes. That's not necessarily great, but...

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 60 +++++++++++++++++++++++------
 1 file changed, 48 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 13d49eecd22..6b3f047bd33 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2693,12 +2693,46 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		FatalError = true;
 
 	/* We now transit into a state of waiting for children to die */
-	if (pmState == PM_RECOVERY ||
-		pmState == PM_HOT_STANDBY ||
-		pmState == PM_RUN ||
-		pmState == PM_STOP_BACKENDS ||
-		pmState == PM_WAIT_XLOG_SHUTDOWN)
-		UpdatePMState(PM_WAIT_BACKENDS);
+	switch (pmState)
+	{
+		case PM_INIT:
+			/* shouldn't have any children */
+			Assert(false);
+			break;
+		case PM_STARTUP:
+			/* should have been handled in process_pm_child_exit */
+			Assert(false);
+			break;
+
+			/* wait for children to die */
+		case PM_RECOVERY:
+		case PM_HOT_STANDBY:
+		case PM_RUN:
+		case PM_STOP_BACKENDS:
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_BACKENDS:
+			/* there might be more backends to wait for */
+			break;
+
+		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_XLOG_ARCHIVAL:
+			/*
+			 * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
+			 * the PM_WAIT_BACKENDS && FatalError code in
+			 * PostmasterStateMachine does not have to be duplicated.
+			 *
+			 * XXX: This seems rather ugly, but it's not obvious if the
+			 * alternative is better.
+			 */
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_DEAD_END:
+		case PM_NO_CHILDREN:
+			break;
+	}
 
 	/*
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2836,6 +2870,9 @@ PostmasterStateMachine(void)
 	 * PM_WAIT_BACKENDS, but we signal the processes first, before waiting for
 	 * them.  Treating it as a distinct pmState allows us to share this code
 	 * across multiple shutdown code paths.
+	 *
+	 * Note that HandleFatalError() switches to PM_WAIT_BACKENDS even if we
+	 * were, before the fatal error, in a "more advanced" state.
 	 */
 	if (pmState == PM_STOP_BACKENDS || pmState == PM_WAIT_BACKENDS)
 	{
@@ -2967,13 +3004,12 @@ PostmasterStateMachine(void)
 					 * We don't consult send_abort_for_crash here, as it's
 					 * unlikely that dumping cores would illuminate the reason
 					 * for checkpointer fork failure.
+					 *
+					 * XXX: Is it worth inventing a different PMQUIT value
+					 * that signals that the cluster is in a bad state,
+					 * without a process having crashed?
 					 */
-					FatalError = true;
-					UpdatePMState(PM_WAIT_DEAD_END);
-					ConfigurePostmasterWaitSet(false);
-
-					/* Kill the walsenders and archiver too */
-					SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+					HandleFatalError(PMQUIT_FOR_CRASH, false);
 				}
 			}
 		}
-- 
2.45.2.746.g06e570c0df.dirty

v4-0006-postmaster-Adjust-which-processes-we-expect-to-ha.patchtext/x-diff; charset=us-asciiDownload
From a42558e2b42fd7c029477b53904a3aef3cc0b7d1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:31:03 -0500
Subject: [PATCH v4 6/7] postmaster: Adjust which processes we expect to have
 exited

Comments and code stated that we expect checkpointer to have been signalled in
case of immediate shutdown / fatal errors, but didn't treat archiver and
walsenders the same. That doesn't seem right.

I had started digging through the history to see where this oddity was
introduced, but it's not the fault of a single commit.

Instead treat archiver, checkpointer, and walsenders the same.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6b3f047bd33..de234e7994f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2905,16 +2905,20 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect the checkpointer to exit as well, otherwise not.
+		 * expect archiver, checkpointer and walsender to exit as well,
+		 * otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
-			targetMask = btmask_add(targetMask, B_CHECKPOINTER);
+			targetMask = btmask_add(targetMask,
+									B_CHECKPOINTER,
+									B_ARCHIVER,
+									B_WAL_SENDER);
 
 		/*
-		 * Walsenders and archiver will continue running; they will be
-		 * terminated later after writing the checkpoint record.  We also let
-		 * dead-end children to keep running for now.  The syslogger process
-		 * exits last.
+		 * Normally walsenders and archiver will continue running; they will
+		 * be terminated later after writing the checkpoint record.  We also
+		 * let dead-end children to keep running for now.  The syslogger
+		 * process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2925,13 +2929,17 @@ PostmasterStateMachine(void)
 			BackendTypeMask remainMask = BTYPE_MASK_NONE;
 
 			remainMask = btmask_add(remainMask,
-									B_WAL_SENDER,
-									B_ARCHIVER,
 									B_DEAD_END_BACKEND,
 									B_LOGGER);
 
-			/* checkpointer may or may not be in targetMask already */
-			remainMask = btmask_add(remainMask, B_CHECKPOINTER);
+			/*
+			 * Archiver, checkpointer and walsender may or may not be in
+			 * targetMask already.
+			 */
+			remainMask = btmask_add(remainMask,
+									B_ARCHIVER,
+									B_CHECKPOINTER,
+									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
 			remainMask = btmask_add(remainMask,
-- 
2.45.2.746.g06e570c0df.dirty

v4-0007-Change-shutdown-sequence-to-terminate-checkpointe.patchtext/x-diff; charset=us-asciiDownload
From d95b0f1b0d52c3f1db191f3d79e62e3489bf73ee Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Jan 2025 11:45:13 -0500
Subject: [PATCH v4 7/7] Change shutdown sequence to terminate checkpointer
 last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. Serializing stats already
happens in checkpointer, even though walsenders can be active longer.

The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.

Another need for this change originates in the AIO patchset, where IO
workers (which, in some edge cases, can emit stats of their own) need to run
while the shutdown checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled (via
SIGINT) to trigger writing the shutdown checkpoint without terminating
it. Once checkpointer wrote the checkpoint it will wait for a termination
signal (SIGUSR2, as before).

Postmaster now triggers the shutdown checkpoint via SIGINT, where we
previously did so by terminating checkpointer. Checkpointer now is terminated
after all children other than dead-end ones have been terminated, tracked
using the new PM_WAIT_CHECKPOINTER PMState.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 125 +++++++++++----
 src/backend/postmaster/postmaster.c           | 143 +++++++++++++-----
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 200 insertions(+), 72 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index dd2c8376c6e..767bf9f5cf8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,10 +10,13 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * Normal termination is by SIGUSR2, which instructs the checkpointer to
- * execute a shutdown checkpoint and then exit(0).  (All backends must be
- * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
- * like any backend, the checkpointer will simply abort and exit on SIGQUIT.
+ * The normal termination sequence is that checkpointer is instructed to
+ * execute the shutdown checkpoint by SIGINT.  After that checkpointer waits
+ * to be terminated via SIGUSR2, which instructs the checkpointer to exit(0).
+ * All backends must be stopped before SIGINT or SIGUSR2 is issued!
+ *
+ * Emergency termination is by SIGQUIT; like any backend, the checkpointer
+ * will simply abort and exit on SIGQUIT.
  *
  * If the checkpointer exits unexpectedly, the postmaster treats that the same
  * as a backend crash: shared memory may be corrupted, so remaining backends
@@ -51,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +145,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -159,6 +164,9 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
+/* Signal handlers */
+static void ReqShutdownXLOG(SIGNAL_ARGS);
+
 
 /*
  * Main entry point for checkpointer process
@@ -188,7 +196,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -211,8 +219,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -327,7 +338,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint or terminate.
 	 */
 	for (;;)
 	{
@@ -346,7 +357,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -517,8 +531,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -557,6 +576,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By separating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -586,29 +655,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -729,6 +775,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -857,6 +904,20 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
+
 /* --------------------------------
  *		communication with backends
  * --------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index de234e7994f..d38251dea8f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -334,6 +334,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -2354,35 +2355,19 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know checkpointer wrote a shutdown
+				 * checkpoint, otherwise we'd still be in
+				 * PM_WAIT_XLOG_SHUTDOWN state.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point only dead-end children should be left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_WAIT_DEAD_END);
+				ConfigurePostmasterWaitSet(false);
+				SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 			}
 			else
 			{
@@ -2718,6 +2703,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
+		case PM_WAIT_CHECKPOINTER:
 			/*
 			 * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
 			 * the PM_WAIT_BACKENDS && FatalError code in
@@ -2979,9 +2965,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2995,10 +2981,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -3023,22 +3009,39 @@ PostmasterStateMachine(void)
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_XLOG_SHUTDOWN to
+	 * PM_WAIT_XLOG_ARCHIVAL is in proccess_pm_pmsignal(), in response to
+	 * PMSIGNAL_XLOG_IS_SHUTDOWN.
+	 */
+
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+		 * than checkpointer and dead-end children left. There shouldn't be
+		 * any regular backends left by now anyway; what we're really waiting
+		 * for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			UpdatePMState(PM_WAIT_DEAD_END);
-			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that everyone important is gone, tell checkpointer to shut
+			 * down too. That allows checkpointer to perform some last bits of
+			 * cleanup without other processes interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_CHECKPOINTER to PM_WAIT_DEAD_END is
+	 * in proccess_pm_child_exit().
+	 */
+
 	if (pmState == PM_WAIT_DEAD_END)
 	{
 		/*
@@ -3175,6 +3178,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3592,6 +3596,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3703,9 +3709,67 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/* Checkpointer completed the shutdown checkpoint */
+		if (pmState == PM_WAIT_XLOG_SHUTDOWN)
+		{
+			/*
+			 * If we have an archiver subprocess, tell it to do a last archive
+			 * cycle and quit. Likewise, if we have walsender processes, tell them
+			 * to send any remaining WAL and quit.
+			 */
+			Assert(Shutdown > NoShutdown);
+
+			/* Waken archiver for the last time */
+			if (PgArchPMChild != NULL)
+				signal_child(PgArchPMChild, SIGUSR2);
+
+			/*
+			 * Waken walsenders for the last time. No regular backends should be
+			 * around anymore.
+			 */
+			SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+			UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+		}
+		else if (!FatalError && Shutdown != ImmediateShutdown)
+		{
+			/*
+			 * Checkpointer only ought to perform the shutdown checkpoint
+			 * during shutdown.  If somehow checkpointer did so in another
+			 * situation, we have no choice but to crash-restart.
+			 *
+			 * It's possible however that we get PMSIGNAL_XLOG_IS_SHUTDOWN
+			 * outside of PM_WAIT_XLOG_SHUTDOWN if an orderly shutdown was
+			 * "interrupted" by a crash or an immediate shutdown.
+			 */
+			ereport(LOG,
+					(errmsg("WAL was shut down unexpectedly")));
+
+			/*
+			 * Doesn't seem likely to help to take send_abort_for_crash into
+			 * account here.
+			 */
+			HandleFatalError(PMQUIT_FOR_CRASH, false);
+		}
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3713,7 +3777,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -4024,6 +4088,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.45.2.746.g06e570c0df.dirty

#15Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#14)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Tue, Jan 14, 2025 at 12:58:44AM -0500, Andres Freund wrote:

Hi,

On 2025-01-13 12:20:39 +0000, Bertrand Drouvot wrote:

We have
multiple copies of code to go to FatalError, that doesn't seem great.

+  * FIXME: This should probably not have a copy fo the code to
+  * transition into FatalError state.
+  */
+  ereport(LOG,
+          (errmsg("WAL was shut down unexpectedly")));

Indeed, might be worth extracting this into a helper function or such.

That is quite a rats nest of issues :(. There are quite a few oddities around
the related code. The attached series contains a few commits trying to unify
the paths transitioning to FatalError = true into the new HandleFatalError().

Thanks!

The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

That would probably make more sense, but would be more invasive and would
need careful testing. Worth a XXX comment in the code?

After the earlier commit that just moves the code I turned the existing if ()
into a switch so that whoever adds new states is told to look at that code by
the compiler.

Good idea!

Thoughts?

======== Patch 0001:

LGTM

======== Patch 0002:

It makes use of TerminateChildren() in HandleChildCrash(). TerminateChildren()
follows the exact same logic as the code being replaced (exclude syslogger,
StartupPMChild special case), so LGTM.

======== Patch 0003:

It replaces the take_action boolean flag with an early return.

One comment:

=== 1

+ TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);

if (Shutdown != ImmediateShutdown)
FatalError = true;

I think we can avoid this remaining extra check and set FatalError to true
unconditionally (as we already know that Shutdown != ImmediateShutdown as per
the first check in the function).

======== Patch 0004:

One comment:

=== 2

As per comment === 1 above, then we could remove:

if (Shutdown != ImmediateShutdown)
FatalError = true;

from HandleFatalError(). And that would be up to the HandleFatalError() caller
to set FatalError to true (prior calling HandleFatalError()).

HandleFatalError becomes a pure "transition to error state" function then.
Does that make sense to you?

======== Patch 0005:

Comments:

=== 3

If so, then in this change:

-  FatalError = true;
-  UpdatePMState(PM_WAIT_DEAD_END);
-  ConfigurePostmasterWaitSet(false);
-
-  /* Kill the walsenders and archiver too */
-  SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+  HandleFatalError(PMQUIT_FOR_CRASH, false);

we should keep the "FatalError = true" part.

=== 4

+  * XXX: Is it worth inventing a different PMQUIT value
+  * that signals that the cluster is in a bad state,
+  * without a process having crashed?
   */

That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

Before doing so, what do you think about:

1. keeping FatalError = true; as suggeted above
2. put back the UpdatePMState(PM_WAIT_DEAD_END) (prior to the HandleFatalError(
PMQUIT_FOR_CRASH, false) call
3. put the ConfigurePostmasterWaitSet(false) in the PM_WAIT_DEAD_END switch in
HandleFatalError()?

That's not ideal but that way we would keep the "current" behavior (i.e failure
to fork checkpointer transitions through PM_WAIT_DEAD_END) during the time the
"switch directly to PM_DEAD_END" is not implemented?

But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).

======== Patch 0006:

The patch adds archiver and walsender to processes that should exit during
crash/immediate shutdown.

LGTM.

======== Patch 0007:

Comments:

=== 5

commit message says "Change shutdown sequence to terminate checkpointer last",
which is not 100% accurate.

While this one is:

"
Checkpointer now is terminated
after all children other than dead-end ones have been terminated,
"

=== 6 (Nit)

+ * Close down the database.

s/database/database system/?

Was already there before the patch, just suggesting in passing...

=== 7

+ * PM_WAIT_XLOG_ARCHIVAL is in proccess_pm_pmsignal(), in response to

typo: s/proccess_pm_pmsignal/process_pm_pmsignal/

=== 8

+ * Now that everyone important is gone

s/everyone important is/walsenders and archiver are also gone/?

=== 9

+ * in proccess_pm_child_exit().

typo: s/proccess_pm_child_exit/process_pm_child_exit/

=== 10

+   /*
+    * Doesn't seem likely to help to take send_abort_for_crash into
+    * account here.
+    */
+    HandleFatalError(PMQUIT_FOR_CRASH, false);

If comments === 2 and === 3 made sense to you then we should set FatalError
to true here prior the HandleFatalError call.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#15)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:

On Tue, Jan 14, 2025 at 12:58:44AM -0500, Andres Freund wrote:

The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

That would probably make more sense, but would be more invasive and would
need careful testing. Worth a XXX comment in the code?

There is, although I guess it could be reformulated a bit.

======== Patch 0003:

It replaces the take_action boolean flag with an early return.

One comment:

=== 1

+ TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);

if (Shutdown != ImmediateShutdown)
FatalError = true;

I think we can avoid this remaining extra check and set FatalError to true
unconditionally (as we already know that Shutdown != ImmediateShutdown as per
the first check in the function).

======== Patch 0004:

One comment:

=== 2

As per comment === 1 above, then we could remove:

if (Shutdown != ImmediateShutdown)
FatalError = true;

from HandleFatalError(). And that would be up to the HandleFatalError() caller
to set FatalError to true (prior calling HandleFatalError()).

HandleFatalError becomes a pure "transition to error state" function then.
Does that make sense to you?

I don't see what we'd gain from moving FatalError = true separate from
HandleFatalError? All it'll mean is more copied code?

=== 4

+  * XXX: Is it worth inventing a different PMQUIT value
+  * that signals that the cluster is in a bad state,
+  * without a process having crashed?
*/

That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.

For now I think I'll leave it as such, as we'd need to have a new message in
quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
differ, the rest seems like it'd largely be duplicated:

case PMQUIT_FOR_CRASH:
/* A crash-and-restart cycle is in progress */
ereport(WARNING_CLIENT_ONLY,
(errcode(ERRCODE_CRASH_SHUTDOWN),
errmsg("terminating connection because of crash of another server process"),
errdetail("The postmaster has commanded this server process to roll back"
" the current transaction and exit, because another"
" server process exited abnormally and possibly corrupted"
" shared memory."),
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
break;

That seems like a separate change. Particularly because I'm just working on
this as part of some nasty three layer deep dependency chain - aio is pretty
big on its own...

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

Before doing so, what do you think about:

1. keeping FatalError = true; as suggeted above
2. put back the UpdatePMState(PM_WAIT_DEAD_END) (prior to the HandleFatalError(
PMQUIT_FOR_CRASH, false) call

I don't like that, what's the point of having something like HandleFatalError
if we duplicate more code than it saves to each place?

I don't think we need to either, we can just do that in the relevant case
statement in HandleFatalError(). That works, I went back and forth several
times :)

3. put the ConfigurePostmasterWaitSet(false) in the PM_WAIT_DEAD_END switch in
HandleFatalError()?

That one would make sense to me.

But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).

Why would it need to?

=== 8

+ * Now that everyone important is gone

s/everyone important is/walsenders and archiver are also gone/?

I'd rather get away from these lists of specific processes - they end up being
noisy in the git history because the lines get updated to add/remove items (or
updates to them get missed).

Greetings,

Andres Freund

#17Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#16)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Tue, Jan 14, 2025 at 06:55:03PM -0500, Andres Freund wrote:

Hi,

On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:

On Tue, Jan 14, 2025 at 12:58:44AM -0500, Andres Freund wrote:

The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

That would probably make more sense, but would be more invasive and would
need careful testing. Worth a XXX comment in the code?

There is, although I guess it could be reformulated a bit.

Okay, thanks!

I think we can avoid this remaining extra check and set FatalError to true
unconditionally (as we already know that Shutdown != ImmediateShutdown as per
the first check in the function).

======== Patch 0004:

One comment:

=== 2

As per comment === 1 above, then we could remove:

if (Shutdown != ImmediateShutdown)
FatalError = true;

from HandleFatalError(). And that would be up to the HandleFatalError() caller
to set FatalError to true (prior calling HandleFatalError()).

HandleFatalError becomes a pure "transition to error state" function then.
Does that make sense to you?

I don't see what we'd gain from moving FatalError = true separate from
HandleFatalError? All it'll mean is more copied code?

The "Shutdown != ImmediateShutdown" test made me think that it might be cases
where we don't want to set FatalError to true in HandleFatalError(), so the
proposal. I can see that that's not the case (which makes fully sense) so I agree
that it's better to set FatalError to true unconditionally in HandleFatalError().

=== 4

+  * XXX: Is it worth inventing a different PMQUIT value
+  * that signals that the cluster is in a bad state,
+  * without a process having crashed?
*/

That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.

For now I think I'll leave it as such, as we'd need to have a new message in
quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
differ, the rest seems like it'd largely be duplicated:

case PMQUIT_FOR_CRASH:
/* A crash-and-restart cycle is in progress */
ereport(WARNING_CLIENT_ONLY,
(errcode(ERRCODE_CRASH_SHUTDOWN),
errmsg("terminating connection because of crash of another server process"),
errdetail("The postmaster has commanded this server process to roll back"
" the current transaction and exit, because another"
" server process exited abnormally and possibly corrupted"
" shared memory."),
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
break;

That seems like a separate change. Particularly because I'm just working on
this as part of some nasty three layer deep dependency chain - aio is pretty
big on its own...

Fair enough. I'll look at the remaining pieces once this stuff goes in.

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

Before doing so, what do you think about:

1. keeping FatalError = true; as suggeted above
2. put back the UpdatePMState(PM_WAIT_DEAD_END) (prior to the HandleFatalError(
PMQUIT_FOR_CRASH, false) call

I don't like that, what's the point of having something like HandleFatalError
if we duplicate more code than it saves to each place?

I don't think we need to either, we can just do that in the relevant case
statement in HandleFatalError(). That works, I went back and forth several
times :)

Okay ;-)

3. put the ConfigurePostmasterWaitSet(false) in the PM_WAIT_DEAD_END switch in
HandleFatalError()?

That one would make sense to me.

But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).

Why would it need to?

Because I thought that TerminateChildren() needs to be called after
ConfigurePostmasterWaitSet(false). If not, could new connections be accepted in
the window between TerminateChildren() and ConfigurePostmasterWaitSet(false)?

=== 8

+ * Now that everyone important is gone

s/everyone important is/walsenders and archiver are also gone/?

I'd rather get away from these lists of specific processes - they end up being
noisy in the git history because the lines get updated to add/remove items (or
updates to them get missed).

Good point, that's more "difficult" to maintain. Though I'm not sure that
"important" is the right wording. "Now that non dead processes have finished"
maybe? Could be seen as a Nit too.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#18Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#17)
7 attachment(s)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

Attached is a new version of this patchset:

- HandleFatalError now doesn't expect to be called if already in FatalError or
ImmediateShutdown and contains a comment and assertions to that effect

- Fatal errors when in pmState > PM_WAIT_BACKENDS go to PM_WAIT_DEAD_END
directly. The previous version had them go to PM_WAIT_BACKENDS, to save
duplicating one line. The ConfigurePostmasterWaitSet(false) call is now
duplicated, with a comment at both sites mentioning that.

- Typo, comment and commit message fixes/polishing.

Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.

On 2025-01-15 08:38:33 +0000, Bertrand Drouvot wrote:

On Tue, Jan 14, 2025 at 06:55:03PM -0500, Andres Freund wrote:

On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:

=== 4

+  * XXX: Is it worth inventing a different PMQUIT value
+  * that signals that the cluster is in a bad state,
+  * without a process having crashed?
*/

That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.

For now I think I'll leave it as such, as we'd need to have a new message in
quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
differ, the rest seems like it'd largely be duplicated:

case PMQUIT_FOR_CRASH:
/* A crash-and-restart cycle is in progress */
ereport(WARNING_CLIENT_ONLY,
(errcode(ERRCODE_CRASH_SHUTDOWN),
errmsg("terminating connection because of crash of another server process"),
errdetail("The postmaster has commanded this server process to roll back"
" the current transaction and exit, because another"
" server process exited abnormally and possibly corrupted"
" shared memory."),
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
break;

That seems like a separate change. Particularly because I'm just working on
this as part of some nasty three layer deep dependency chain - aio is pretty
big on its own...

Fair enough. I'll look at the remaining pieces once this stuff goes in.

Cool. I did try writing the change, but it does indeed seem better as a
separate patch. Besides the error message, it also seems we ought to invent a
different ERRCODE, as neither ERRCODE_ADMIN_SHUTDOWN nor
ERRCODE_CRASH_SHUTDOWN seem appropriate.

Vaguely related and not at all crucial:

quickdie(), in the PMQUIT_FOR_CRASH, does
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));

but in case of an immediate *restart* (via pg_ctl restart -m immediate) we'll
not have such hint. postmaster.c doesn't know it's an immediate restart, so
that's not surprising, but it still seems a bit weird. One would hope that
immediate restarts are more common than crashes...

But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).

Why would it need to?

Because I thought that TerminateChildren() needs to be called after
ConfigurePostmasterWaitSet(false). If not, could new connections be accepted in
the window between TerminateChildren() and ConfigurePostmasterWaitSet(false)?

I don't think that can happen with the attached patch - the
ConfigurePostmasterWaitSet() is called before HandleFatalError() returns, so
no new connections can be accepted, as that happens only from
ServerLoop()->AcceptConnection().

=== 8

+ * Now that everyone important is gone

s/everyone important is/walsenders and archiver are also gone/?

I'd rather get away from these lists of specific processes - they end up being
noisy in the git history because the lines get updated to add/remove items (or
updates to them get missed).

Good point, that's more "difficult" to maintain. Though I'm not sure that
"important" is the right wording. "Now that non dead processes have finished"
maybe? Could be seen as a Nit too.

This specific case I just solved by referencing "the processes mentioned
above"...

Greetings,

Andres Freund

Attachments:

v5-0001-checkpointer-Request-checkpoint-via-latch-instead.patchtext/x-diff; charset=us-asciiDownload
From a611c406af152049b550c45dc77a4678d6fc5514 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 24 Jan 2025 11:32:03 -0500
Subject: [PATCH v5 1/7] checkpointer: Request checkpoint via latch instead of
 signal

The motivation for this change is that a future commit will use SIGINT for
another purpose (postmaster requesting WAL access to be shut down) and that
there no other signals that we could readily use (see code comment for the
reason why SIGTERM shouldn't be used). But it's also a tad nicer / more
efficient to use SetLatch(), as it avoids sending signals when checkpointer
already is busy.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/checkpointer.c | 60 +++++++++------------------
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..dd2c8376c6e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
 
 /*
  * Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
-/* --------------------------------
- *		signal handler routines
- * --------------------------------
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
-	SetLatch(MyLatch);
-}
-
-
 /* --------------------------------
  *		communication with backends
  * --------------------------------
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without the SetLatch().
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-- 
2.48.1.76.g4e746b1a31.dirty

v5-0002-postmaster-Don-t-open-code-TerminateChildren-in-H.patchtext/x-diff; charset=us-asciiDownload
From 7c643fe209afa70804388888c6be19167f8795be Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Jan 2025 23:20:25 -0500
Subject: [PATCH v5 2/7] postmaster: Don't open-code TerminateChildren() in
 HandleChildCrash()

After removing the duplication no user of sigquit_child() remains, therefore
remove it.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 49 +++++++----------------------
 1 file changed, 11 insertions(+), 38 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5488289a489..ea92fb56c80 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -243,6 +243,13 @@ bool		enable_bonjour = false;
 char	   *bonjour_name;
 bool		restart_after_crash = true;
 bool		remove_temp_files_after_crash = true;
+
+/*
+ * When terminating child processes after fatal errors, like a crash of a
+ * child process, we normally send SIGQUIT -- and most other comments in this
+ * file are written on the assumption that we do -- but developers might
+ * prefer to use SIGABRT to collect per-child core dumps.
+ */
 bool		send_abort_for_crash = false;
 bool		send_abort_for_kill = false;
 
@@ -424,7 +431,6 @@ static int	BackendStartup(ClientSocket *client_sock);
 static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
 static CAC_state canAcceptConnections(BackendType backend_type);
 static void signal_child(PMChild *pmchild, int signal);
-static void sigquit_child(PMChild *pmchild);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
@@ -2701,32 +2707,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	/*
 	 * Signal all other child processes to exit.  The crashed process has
 	 * already been removed from ActiveChildList.
+	 *
+	 * We could exclude dead-end children here, but at least when sending
+	 * SIGABRT it seems better to include them.
 	 */
 	if (take_action)
-	{
-		dlist_iter	iter;
-
-		dlist_foreach(iter, &ActiveChildList)
-		{
-			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
-
-			/* We do NOT restart the syslogger */
-			if (bp == SysLoggerPMChild)
-				continue;
-
-			if (bp == StartupPMChild)
-				StartupStatus = STARTUP_SIGNALED;
-
-			/*
-			 * This backend is still alive.  Unless we did so already, tell it
-			 * to commit hara-kiri.
-			 *
-			 * We could exclude dead-end children here, but at least when
-			 * sending SIGABRT it seems better to include them.
-			 */
-			sigquit_child(bp);
-		}
-	}
+		TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
 
 	if (Shutdown != ImmediateShutdown)
 		FatalError = true;
@@ -3349,19 +3335,6 @@ signal_child(PMChild *pmchild, int signal)
 #endif
 }
 
-/*
- * Convenience function for killing a child process after a crash of some
- * other child process.  We apply send_abort_for_crash to decide which signal
- * to send.  Normally it's SIGQUIT -- and most other comments in this file are
- * written on the assumption that it is -- but developers might prefer to use
- * SIGABRT to collect per-child core dumps.
- */
-static void
-sigquit_child(PMChild *pmchild)
-{
-	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
-}
-
 /*
  * Send a signal to the targeted children.
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v5-0003-postmaster-Don-t-repeatedly-transition-to-crashin.patchtext/x-diff; charset=us-asciiDownload
From 75c5eaa5c8a81885f8eba0b727e8cdeb85d84165 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 24 Jan 2025 11:40:50 -0500
Subject: [PATCH v5 3/7] postmaster: Don't repeatedly transition to crashing
 state

Previously HandleChildCrash() skipped logging and signalling child exits if
already in an immediate shutdown or FatalError, but still transitioned server
state in response to a crash. That's redundant.

In the other place we transition to FatalError, we do take care to not do so
when already in FatalError state.

To make it easier to combine different paths for entering FatalError state,
only do so once in HandleChildCrash().

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ea92fb56c80..a24c0385fe6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2685,8 +2685,6 @@ CleanupBackend(PMChild *bp,
 static void
 HandleChildCrash(int pid, int exitstatus, const char *procname)
 {
-	bool		take_action;
-
 	/*
 	 * We only log messages and send signals if this is the first process
 	 * crash and we're not doing an immediate shutdown; otherwise, we're only
@@ -2694,15 +2692,13 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
 
-	if (take_action)
-	{
-		LogChildExit(LOG, procname, pid, exitstatus);
-		ereport(LOG,
-				(errmsg("terminating any other active server processes")));
-		SetQuitSignalReason(PMQUIT_FOR_CRASH);
-	}
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
+	SetQuitSignalReason(PMQUIT_FOR_CRASH);
 
 	/*
 	 * Signal all other child processes to exit.  The crashed process has
@@ -2711,11 +2707,9 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * We could exclude dead-end children here, but at least when sending
 	 * SIGABRT it seems better to include them.
 	 */
-	if (take_action)
-		TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+	TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
 
-	if (Shutdown != ImmediateShutdown)
-		FatalError = true;
+	FatalError = true;
 
 	/* We now transit into a state of waiting for children to die */
 	if (pmState == PM_RECOVERY ||
-- 
2.48.1.76.g4e746b1a31.dirty

v5-0004-postmaster-Move-code-to-switch-into-FatalError-st.patchtext/x-diff; charset=us-asciiDownload
From 8b1ffdd4b4b14aced5d78c854202737f9d5b57ed Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:57:12 -0500
Subject: [PATCH v5 4/7] postmaster: Move code to switch into FatalError state
 into function

There are two places switching to FatalError mode, behaving somewhat
differently. An upcoming commit will introduce a third. That doesn't seem seem
like a good idea.

This commit just moves the FatalError related code from HandleChildCrash()
into its own function, a subsequent commit will evolve the state machine
change to be suitable for other callers.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 74 ++++++++++++++++++++---------
 1 file changed, 51 insertions(+), 23 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a24c0385fe6..a92fe4240b2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2674,40 +2674,35 @@ CleanupBackend(PMChild *bp,
 }
 
 /*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, archiver, slot sync worker, or background worker.
+ * Transition into FatalError state, in response to something bad having
+ * happened. Commonly the caller will have logged the reason for entering
+ * FatalError state.
  *
- * The objectives here are to clean up our local state about the child
- * process, and to signal all other remaining children to quickdie.
- *
- * The caller has already released its PMChild slot.
+ * This should only be called when not already in FatalError or
+ * ImmediateShutdown state.
  */
 static void
-HandleChildCrash(int pid, int exitstatus, const char *procname)
+HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 {
-	/*
-	 * We only log messages and send signals if this is the first process
-	 * crash and we're not doing an immediate shutdown; otherwise, we're only
-	 * here to update postmaster's idea of live processes.  If we have already
-	 * signaled children, nonzero exit status is to be expected, so don't
-	 * clutter log.
-	 */
-	if (FatalError || Shutdown == ImmediateShutdown)
-		return;
+	int			sigtosend;
+
+	Assert(!FatalError);
+	Assert(Shutdown != ImmediateShutdown);
+
+	SetQuitSignalReason(reason);
 
-	LogChildExit(LOG, procname, pid, exitstatus);
-	ereport(LOG,
-			(errmsg("terminating any other active server processes")));
-	SetQuitSignalReason(PMQUIT_FOR_CRASH);
+	if (consider_sigabrt && send_abort_for_crash)
+		sigtosend = SIGABRT;
+	else
+		sigtosend = SIGQUIT;
 
 	/*
-	 * Signal all other child processes to exit.  The crashed process has
-	 * already been removed from ActiveChildList.
+	 * Signal all other child processes to exit.
 	 *
 	 * We could exclude dead-end children here, but at least when sending
 	 * SIGABRT it seems better to include them.
 	 */
-	TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+	TerminateChildren(sigtosend);
 
 	FatalError = true;
 
@@ -2727,6 +2722,39 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		AbortStartTime = time(NULL);
 }
 
+/*
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter, autovacuum, archiver, slot sync worker, or background worker.
+ *
+ * The objectives here are to clean up our local state about the child
+ * process, and to signal all other remaining children to quickdie.
+ *
+ * The caller has already released its PMChild slot.
+ */
+static void
+HandleChildCrash(int pid, int exitstatus, const char *procname)
+{
+	/*
+	 * We only log messages and send signals if this is the first process
+	 * crash and we're not doing an immediate shutdown; otherwise, we're only
+	 * here to update postmaster's idea of live processes.  If we have already
+	 * signaled children, nonzero exit status is to be expected, so don't
+	 * clutter log.
+	 */
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
+
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
+
+	/*
+	 * Switch into error state. The crashed process has already been removed
+	 * from ActiveChildList.
+	 */
+	HandleFatalError(PMQUIT_FOR_CRASH, true);
+}
+
 /*
  * Log the death of a child process.
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v5-0005-postmaster-Commonalize-FatalError-paths.patchtext/x-diff; charset=us-asciiDownload
From 627fcf50e7e03e12dc1c61ec76c32fb48532796a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:25:01 -0500
Subject: [PATCH v5 5/7] postmaster: Commonalize FatalError paths

This includes some behavioral changes:

- Previously PM_WAIT_XLOG_ARCHIVAL wasn't handled in HandleFatalError(), that
  doesn't seem quite right.

- Previously a fatal error in PM_WAIT_XLOG_SHUTDOWN lead to jumping back to
  PM_WAIT_BACKENDS, no we go to PM_WAIT_DEAD_END. Jumping backwards doesn't
  seem quite right and we didn't do so when checkpointer failed to fork during
  a shutdown.

- Previously a checkpointer fork failure didn't call SetQuitSignalReason(),
  which would lead to quickdie() reporting
  "terminating connection because of unexpected SIGQUIT signal"
  which seems even worse than the PMQUIT_FOR_CRASH message. If I saw that in
  the log I'd suspect somebody outside of postgres sent SIGQUITs

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 73 ++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 16 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a92fe4240b2..95eb989355c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2706,13 +2706,49 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 
 	FatalError = true;
 
-	/* We now transit into a state of waiting for children to die */
-	if (pmState == PM_RECOVERY ||
-		pmState == PM_HOT_STANDBY ||
-		pmState == PM_RUN ||
-		pmState == PM_STOP_BACKENDS ||
-		pmState == PM_WAIT_XLOG_SHUTDOWN)
-		UpdatePMState(PM_WAIT_BACKENDS);
+	/*
+	 * Choose the appropriate new state to react to the fatal error. Unless we
+	 * were already in the process of shutting down, we go through
+	 * PM_WAIT_BACKEND. For errors during the shutdown sequence, we directly
+	 * switch to PM_WAIT_DEAD_END.
+	 */
+	switch (pmState)
+	{
+		case PM_INIT:
+			/* shouldn't have any children */
+			Assert(false);
+			break;
+		case PM_STARTUP:
+			/* should have been handled in process_pm_child_exit */
+			Assert(false);
+			break;
+
+			/* wait for children to die */
+		case PM_RECOVERY:
+		case PM_HOT_STANDBY:
+		case PM_RUN:
+		case PM_STOP_BACKENDS:
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_BACKENDS:
+			/* there might be more backends to wait for */
+			break;
+
+		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_XLOG_ARCHIVAL:
+			/*
+			 * NB: Similar code exists in PostmasterStateMachine()'s handling
+			 * of FatalError in PM_STOP_BACKENDS/PM_WAIT_BACKENDS states.
+			 */
+			ConfigurePostmasterWaitSet(false);
+			UpdatePMState(PM_WAIT_DEAD_END);
+			break;
+
+		case PM_WAIT_DEAD_END:
+		case PM_NO_CHILDREN:
+			break;
+	}
 
 	/*
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2942,15 +2978,18 @@ PostmasterStateMachine(void)
 			{
 				/*
 				 * Stop any dead-end children and stop creating new ones.
+				 *
+				 * NB: Similar code exists in HandleFatalErrors(), when the
+				 * error happens in pmState > PM_WAIT_BACKENDS.
 				 */
 				UpdatePMState(PM_WAIT_DEAD_END);
 				ConfigurePostmasterWaitSet(false);
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd auxiliary processes (other than
+				 * logger), if any, when we started immediate shutdown or
+				 * entered FatalError state.
 				 */
 			}
 			else
@@ -2981,13 +3020,15 @@ PostmasterStateMachine(void)
 					 * We don't consult send_abort_for_crash here, as it's
 					 * unlikely that dumping cores would illuminate the reason
 					 * for checkpointer fork failure.
+					 *
+					 * XXX: It may be worth to introduce a different PMQUIT
+					 * value that signals that the cluster is in a bad state,
+					 * without a process having crashed. But right now this
+					 * path is very unlikely to be reached, so it isn't
+					 * obviously worthwhile adding a distinct error message in
+					 * quickdie().
 					 */
-					FatalError = true;
-					UpdatePMState(PM_WAIT_DEAD_END);
-					ConfigurePostmasterWaitSet(false);
-
-					/* Kill the walsenders and archiver too */
-					SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+					HandleFatalError(PMQUIT_FOR_CRASH, false);
 				}
 			}
 		}
-- 
2.48.1.76.g4e746b1a31.dirty

v5-0006-postmaster-Adjust-which-processes-we-expect-to-ha.patchtext/x-diff; charset=us-asciiDownload
From 810ac95b2ddcaea3b78ab1809cdcc8c522a27dc6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:31:03 -0500
Subject: [PATCH v5 6/7] postmaster: Adjust which processes we expect to have
 exited

Comments and code stated that we expect checkpointer to have been signalled in
case of immediate shutdown / fatal errors, but didn't treat archiver and
walsenders the same. That doesn't seem right.

I had started digging through the history to see where this oddity was
introduced, but it's not the fault of a single commit.

Instead treat archiver, checkpointer, and walsenders the same.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 95eb989355c..37bb0ea4d7a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2918,16 +2918,20 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect the checkpointer to exit as well, otherwise not.
+		 * expect archiver, checkpointer and walsender to exit as well,
+		 * otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
-			targetMask = btmask_add(targetMask, B_CHECKPOINTER);
+			targetMask = btmask_add(targetMask,
+									B_CHECKPOINTER,
+									B_ARCHIVER,
+									B_WAL_SENDER);
 
 		/*
-		 * Walsenders and archiver will continue running; they will be
-		 * terminated later after writing the checkpoint record.  We also let
-		 * dead-end children to keep running for now.  The syslogger process
-		 * exits last.
+		 * Normally walsenders and archiver will continue running; they will
+		 * be terminated later after writing the checkpoint record.  We also
+		 * let dead-end children to keep running for now.  The syslogger
+		 * process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2938,13 +2942,17 @@ PostmasterStateMachine(void)
 			BackendTypeMask remainMask = BTYPE_MASK_NONE;
 
 			remainMask = btmask_add(remainMask,
-									B_WAL_SENDER,
-									B_ARCHIVER,
 									B_DEAD_END_BACKEND,
 									B_LOGGER);
 
-			/* checkpointer may or may not be in targetMask already */
-			remainMask = btmask_add(remainMask, B_CHECKPOINTER);
+			/*
+			 * Archiver, checkpointer and walsender may or may not be in
+			 * targetMask already.
+			 */
+			remainMask = btmask_add(remainMask,
+									B_ARCHIVER,
+									B_CHECKPOINTER,
+									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
 			remainMask = btmask_add(remainMask,
-- 
2.48.1.76.g4e746b1a31.dirty

v5-0007-Change-shutdown-sequence-to-terminate-checkpointe.patchtext/x-diff; charset=us-asciiDownload
From 3877e1dffa991254082aca07afe4e6d69e1b6ddc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 24 Jan 2025 12:25:14 -0500
Subject: [PATCH v5 7/7] Change shutdown sequence to terminate checkpointer
 last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. Serializing stats already
happens in checkpointer, even though walsenders can be active longer.

The only reason the current shutdown sequence does not actively cause
problems, is that walsender currently does not generate any stats. However,
there is a patch to change that.

Another need for this change originates in the AIO patchset, where IO
workers (which, in some edge cases, can emit stats of their own) need to run
while the shutdown checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled (via
SIGINT) to trigger writing the shutdown checkpoint without terminating
it. Once checkpointer wrote the checkpoint it will wait for a termination
signal (SIGUSR2, as before).

Postmaster now triggers the shutdown checkpoint via SIGINT, where we
previously did so by terminating checkpointer. Checkpointer now is terminated
after all children other than dead-end children and logger have been
terminated, tracked using the new PM_WAIT_CHECKPOINTER PMState.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 127 ++++++++++++----
 src/backend/postmaster/postmaster.c           | 140 +++++++++++++-----
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 202 insertions(+), 69 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index dd2c8376c6e..b94f9cdff21 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,10 +10,13 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * Normal termination is by SIGUSR2, which instructs the checkpointer to
- * execute a shutdown checkpoint and then exit(0).  (All backends must be
- * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
- * like any backend, the checkpointer will simply abort and exit on SIGQUIT.
+ * The normal termination sequence is that checkpointer is instructed to
+ * execute the shutdown checkpoint by SIGINT.  After that checkpointer waits
+ * to be terminated via SIGUSR2, which instructs the checkpointer to exit(0).
+ * All backends must be stopped before SIGINT or SIGUSR2 is issued!
+ *
+ * Emergency termination is by SIGQUIT; like any backend, the checkpointer
+ * will simply abort and exit on SIGQUIT.
  *
  * If the checkpointer exits unexpectedly, the postmaster treats that the same
  * as a backend crash: shared memory may be corrupted, so remaining backends
@@ -51,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +145,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -159,6 +164,9 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
+/* Signal handlers */
+static void ReqShutdownXLOG(SIGNAL_ARGS);
+
 
 /*
  * Main entry point for checkpointer process
@@ -188,7 +196,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -211,8 +219,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -327,7 +338,8 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write the shutdown checkpoint or
+	 * terminate.
 	 */
 	for (;;)
 	{
@@ -346,7 +358,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -517,8 +532,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -557,6 +577,57 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+		ShutdownXLOGPending = false;
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By separating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -586,29 +657,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -729,6 +777,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -857,6 +906,20 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
+
 /* --------------------------------
  *		communication with backends
  * --------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 37bb0ea4d7a..bb22b13adef 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -341,6 +341,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -2363,35 +2364,20 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know checkpointer wrote a shutdown
+				 * checkpoint, otherwise we'd still be in
+				 * PM_WAIT_XLOG_SHUTDOWN state.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point only dead-end children and logger should be
+				 * left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_WAIT_DEAD_END);
+				ConfigurePostmasterWaitSet(false);
+				SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 			}
 			else
 			{
@@ -2737,6 +2723,8 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
+		case PM_WAIT_CHECKPOINTER:
+
 			/*
 			 * NB: Similar code exists in PostmasterStateMachine()'s handling
 			 * of FatalError in PM_STOP_BACKENDS/PM_WAIT_BACKENDS states.
@@ -3011,10 +2999,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -3042,22 +3030,40 @@ PostmasterStateMachine(void)
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_XLOG_SHUTDOWN to
+	 * PM_WAIT_XLOG_ARCHIVAL is in process_pm_pmsignal(), in response to
+	 * PMSIGNAL_XLOG_IS_SHUTDOWN.
+	 */
+
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there are no children other
+		 * than checkpointer, dead-end children and logger left. There
+		 * shouldn't be any regular backends left by now anyway; what we're
+		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			UpdatePMState(PM_WAIT_DEAD_END);
-			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that the processes mentioned above are gone, tell
+			 * checkpointer to shut down too. That allows checkpointer to
+			 * perform some last bits of cleanup without other processes
+			 * interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_CHECKPOINTER to PM_WAIT_DEAD_END is
+	 * in process_pm_child_exit().
+	 */
+
 	if (pmState == PM_WAIT_DEAD_END)
 	{
 		/*
@@ -3194,6 +3200,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3612,6 +3619,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3723,9 +3732,67 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/* Checkpointer completed the shutdown checkpoint */
+		if (pmState == PM_WAIT_XLOG_SHUTDOWN)
+		{
+			/*
+			 * If we have an archiver subprocess, tell it to do a last archive
+			 * cycle and quit. Likewise, if we have walsender processes, tell
+			 * them to send any remaining WAL and quit.
+			 */
+			Assert(Shutdown > NoShutdown);
+
+			/* Waken archiver for the last time */
+			if (PgArchPMChild != NULL)
+				signal_child(PgArchPMChild, SIGUSR2);
+
+			/*
+			 * Waken walsenders for the last time. No regular backends should
+			 * be around anymore.
+			 */
+			SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+			UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+		}
+		else if (!FatalError && Shutdown != ImmediateShutdown)
+		{
+			/*
+			 * Checkpointer only ought to perform the shutdown checkpoint
+			 * during shutdown.  If somehow checkpointer did so in another
+			 * situation, we have no choice but to crash-restart.
+			 *
+			 * It's possible however that we get PMSIGNAL_XLOG_IS_SHUTDOWN
+			 * outside of PM_WAIT_XLOG_SHUTDOWN if an orderly shutdown was
+			 * "interrupted" by a crash or an immediate shutdown.
+			 */
+			ereport(LOG,
+					(errmsg("WAL was shut down unexpectedly")));
+
+			/*
+			 * Doesn't seem likely to help to take send_abort_for_crash into
+			 * account here.
+			 */
+			HandleFatalError(PMQUIT_FOR_CRASH, false);
+		}
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3733,7 +3800,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -4044,6 +4111,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.48.1.76.g4e746b1a31.dirty

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#18)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On 2025-01-24 15:06:17 -0500, Andres Freund wrote:

Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.

I did push all but the last yesterday and, not seeing any related issues in
the BF, pushed the last one just now. Let's hope the BF stays green(ish)...

Greetings,

Andres

#20Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#19)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Sat, Jan 25, 2025 at 11:44:54AM -0500, Andres Freund wrote:

Hi,

On 2025-01-24 15:06:17 -0500, Andres Freund wrote:

Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.

I did push all but the last yesterday and, not seeing any related issues in
the BF, pushed the last one just now. Let's hope the BF stays green(ish)...

Just did a quick check and that still looks green(ish) (and when it's not,
it does not seem related to this change).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#18)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Fri, Jan 24, 2025 at 03:06:17PM -0500, Andres Freund wrote:

On 2025-01-15 08:38:33 +0000, Bertrand Drouvot wrote:

Fair enough. I'll look at the remaining pieces once this stuff goes in.

Cool. I did try writing the change, but it does indeed seem better as a
separate patch. Besides the error message, it also seems we ought to invent a
different ERRCODE, as neither ERRCODE_ADMIN_SHUTDOWN nor
ERRCODE_CRASH_SHUTDOWN seem appropriate.

Thanks for the feedback, will look at it.

Vaguely related and not at all crucial:

quickdie(), in the PMQUIT_FOR_CRASH, does
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));

but in case of an immediate *restart* (via pg_ctl restart -m immediate) we'll
not have such hint. postmaster.c doesn't know it's an immediate restart, so
that's not surprising, but it still seems a bit weird. One would hope that
immediate restarts are more common than crashes...

Yeah, it does not make the distinction between an "immediate restart" and an
"immediate stop". I agree that such hint "should" be added in case of "immediate
restart" (and keep "immediate stop" as it is): will give it a look.

But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).

Why would it need to?

Because I thought that TerminateChildren() needs to be called after
ConfigurePostmasterWaitSet(false). If not, could new connections be accepted in
the window between TerminateChildren() and ConfigurePostmasterWaitSet(false)?

I don't think that can happen with the attached patch - the
ConfigurePostmasterWaitSet() is called before HandleFatalError() returns, so
no new connections can be accepted, as that happens only from
ServerLoop()->AcceptConnection().

Thanks for the explanation, that was the missing part in my reasoning. So yeah,
all good as no new connections are accepted.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#22Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bertrand Drouvot (#21)
Re: Reorder shutdown sequence, to flush pgstats later

Hi,

On Mon, Jan 27, 2025 at 03:07:01PM +0000, Bertrand Drouvot wrote:

On Fri, Jan 24, 2025 at 03:06:17PM -0500, Andres Freund wrote:

but in case of an immediate *restart* (via pg_ctl restart -m immediate) we'll
not have such hint. postmaster.c doesn't know it's an immediate restart, so
that's not surprising, but it still seems a bit weird. One would hope that
immediate restarts are more common than crashes...

Yeah, it does not make the distinction between an "immediate restart" and an
"immediate stop". I agree that such hint "should" be added in case of "immediate
restart" (and keep "immediate stop" as it is): will give it a look.

I looked at it and learned that an immediate restart is nothing but a stop
followed by a start (do_restart()): "pg_ctl" sends a SIGQUIT to postmaster to
indicate an immediate stop and later asks for a start (do_start()).

So when the postmaster receives the SIGQUIT it has no clue that a start will
follow (as you said up-thread).

I’m not familiar with this area of the code, and just discovered it behaves that
way.

To give quickdie() more context I can see 3 options:

1. Use another signal for an immediate restart (but looking at PostmasterMain()
I don’t see any "good" remaining signal to use)
2. Use a shared memory flag to indicate an immediate restart (but this kind of
"communication" looks to be new between "pg_ctl" and postmaster and might be
over complicating things)
3. Use a new DB_SHUTDOWNING_FOR_RESTART or such in DBState and let pg_ctl modify
the control file? (That way we could give more context to quickdie())

It looks to me that 3. makes the more sense (but as said I'm not familiar with
this code so maybe there is another option one could think of or my proposal
is not appropriate), thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com