processes stuck in shutdown following OOM/recovery

Started by Justin Pryzbyabout 2 years ago11 messages
#1Justin Pryzby
pryzby@telsasoft.com

If postgres starts, and one of its children is immediately killed, and
the cluster is also told to stop, then, instead, the whole system gets
wedged.

$ initdb -D ./pgdev.dat1
$ pg_ctl -D ./pgdev.dat1 start -o '-c port=5678'
$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid
.......................................................... failed
pg_ctl: server does not shut down

$ ps -wwwf --ppid 2524494
UID PID PPID C STIME TTY TIME CMD
pryzbyj 2524552 2524494 0 20:47 ? 00:00:00 postgres: checkpointer

(gdb) bt
#0 0x00007f0ce2d08c03 in epoll_wait (epfd=10, events=0x55cb4cbaac28, maxevents=1, timeout=timeout@entry=156481) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x000055cb4c219208 in WaitEventSetWaitBlock (set=set@entry=0x55cb4cbaabc0, cur_timeout=cur_timeout@entry=156481, occurred_events=occurred_events@entry=0x7ffd80130410,
nevents=nevents@entry=1) at ../src/backend/storage/ipc/latch.c:1583
#2 0x000055cb4c219e02 in WaitEventSetWait (set=0x55cb4cbaabc0, timeout=timeout@entry=300000, occurred_events=occurred_events@entry=0x7ffd80130410, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=83886084) at ../src/backend/storage/ipc/latch.c:1529
#3 0x000055cb4c219f87 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=300000, wait_event_info=wait_event_info@entry=83886084)
at ../src/backend/storage/ipc/latch.c:539
#4 0x000055cb4c1aabc2 in CheckpointerMain () at ../src/backend/postmaster/checkpointer.c:523
#5 0x000055cb4c1a8207 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at ../src/backend/postmaster/auxprocess.c:153
#6 0x000055cb4c1ae63d in StartChildProcess (type=type@entry=CheckpointerProcess) at ../src/backend/postmaster/postmaster.c:5331
#7 0x000055cb4c1b07f3 in ServerLoop () at ../src/backend/postmaster/postmaster.c:1792
#8 0x000055cb4c1b1c56 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x55cb4cbaa380) at ../src/backend/postmaster/postmaster.c:1466
#9 0x000055cb4c0f4c1b in main (argc=5, argv=0x55cb4cbaa380) at ../src/backend/main/main.c:198

I noticed this because of the counter-effective behavior of systemd+PGDG
unit files to run "pg_ctl stop" whenever a backend is killed for OOM:
/messages/by-id/ZVI112aVNCHOQgfF@pryzbyj2023

This affects v15, and fails at 7ff23c6d27 but not its parent.

commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 (HEAD)
Author: Thomas Munro <tmunro@postgresql.org>
Date: Mon Aug 2 17:32:20 2021 +1200

Run checkpointer and bgwriter in crash recovery.

--
Justin

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Justin Pryzby (#1)
Re: processes stuck in shutdown following OOM/recovery

On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid

This affects v15, and fails at 7ff23c6d27 but not its parent.

Repro'd here. I had to make the sleep shorter on my system. Looking...

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#2)
Re: processes stuck in shutdown following OOM/recovery

On Sat, Dec 2, 2023 at 2:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid

This affects v15, and fails at ) but not its parent.

Repro'd here. I had to make the sleep shorter on my system. Looking...

The PostmasterStateMachine() case for PM_WAIT_BACKENDS doesn't tell
the checkpointer to shut down in this race case. We have
CheckpointerPID != 0 (because 7ff23c6d27 starts it earlier than
before), and FatalError is true because a child recently crashed and
we haven't yet received the PMSIGNAL_RECOVERY_STARTED handler that
would clear it. Hmm.

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#3)
1 attachment(s)
Re: processes stuck in shutdown following OOM/recovery

On Sat, Dec 2, 2023 at 3:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Dec 2, 2023 at 2:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid

This affects v15, and fails at ) but not its parent.

Repro'd here. I had to make the sleep shorter on my system. Looking...

The PostmasterStateMachine() case for PM_WAIT_BACKENDS doesn't tell
the checkpointer to shut down in this race case. We have
CheckpointerPID != 0 (because 7ff23c6d27 starts it earlier than
before), and FatalError is true because a child recently crashed and
we haven't yet received the PMSIGNAL_RECOVERY_STARTED handler that
would clear it. Hmm.

Here is a first attempt at fixing this. I am not yet 100% sure if it
is right, and there may be a nicer/simpler way to express the
conditions. It passes the test suite, and it fixes the repro that
Justin posted. FYI on my machine I had to use sleep 0.005 where he
had 0.05, as an FYI if someone else is trying to reproduce the issue.

Attachments:

0001-Fix-rare-recovery-shutdown-hang-due-to-checkpointer.patchapplication/octet-stream; name=0001-Fix-rare-recovery-shutdown-hang-due-to-checkpointer.patchDownload
From b6e7bff6a4f196303356dfb478604a58b077147c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 5 Mar 2024 20:33:14 +1300
Subject: [PATCH] Fix rare recovery shutdown hang due to checkpointer.

Commit 7ff23c6d started running the checkpointer during crash recovery.

As discovered by Justin, in one rare case it could prevent shutdown from
succeeding during a narrow phase at the beginning of crash recovery
after a server crash.

When the the server is automatically restarting but before
PMSIGNAL_RECOVERY_STARTED is received from the startup process,
FatalError is still true.  If a shutdown request arrived in that narrow
window, the PostmasterStateMachine() logic behaved as if the
checkpointer was not running and didn't need to be told to shut down,
and yet waited forever for it to exit.

Now, we can only move from PM_WAIT_BACKENDS state directly to
PM_WAIT_DEADEND if the checkpointer isn't running.  If it is, we now
distinguish between the smart and fast shutdown case where we need to
tell the checkpointer to shutdown and move to PM_SHUTDOWN, and the
immediate shutdown or child crash case where it should already have been
told to quit, and we're still waiting for that to happen so we stay in
PM_WAIT_BACKENDS.

Back-patch to 15.

XXX Experimental patch, not sure yet

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/ZWlrdQarrZvLsgIk@pryzbyj2023
---
 src/backend/postmaster/postmaster.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index da0c627107e..62db752228a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3748,12 +3748,14 @@ PostmasterStateMachine(void)
 			WalSummarizerPID == 0 &&
 			BgWriterPID == 0 &&
 			(CheckpointerPID == 0 ||
-			 (!FatalError && Shutdown < ImmediateShutdown)) &&
+			 (!FatalError && Shutdown < ImmediateShutdown) ||
+			 (FatalError && CheckpointerPID != 0)) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0 &&
 			SlotSyncWorkerPID == 0)
 		{
-			if (Shutdown >= ImmediateShutdown || FatalError)
+			if (CheckpointerPID == 0 &&
+				(Shutdown >= ImmediateShutdown || FatalError))
 			{
 				/*
 				 * Start waiting for dead_end children to die.  This state
@@ -3767,7 +3769,7 @@ PostmasterStateMachine(void)
 				 * FatalError state.
 				 */
 			}
-			else
+			else if (Shutdown > NoShutdown && Shutdown < ImmediateShutdown)
 			{
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
@@ -3805,6 +3807,16 @@ PostmasterStateMachine(void)
 						signal_child(PgArchPID, SIGQUIT);
 				}
 			}
+			else
+			{
+				/*
+				 * Either it's an immediate shutdown or a child crashed, and
+				 * we're still waiting for all the children to quit.  The
+				 * checkpointer was already told to quit.
+				 */
+				Assert(Shutdown == ImmediateShutdown ||
+					   (Shutdown == NoShutdown && FatalError));
+			}
 		}
 	}
 
-- 
2.43.0

#5Martijn Wallet
martijn@dbre.nu
In reply to: Thomas Munro (#4)
Re: processes stuck in shutdown following OOM/recovery

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Hi, I somehow fail to be able to mark all checkboxes on this review page...
However, build and tested with all passed successfully on Rocky Linux release 8.9 (Green Obsidian).
Not sure of more reviewing is needed on other Operating Systems since this is only my second review.

Cheers, Martijn.

The new status of this patch is: Ready for Committer

#6Martijn Wallet
martijn@dbre.nu
In reply to: Thomas Munro (#4)
Re: processes stuck in shutdown following OOM/recovery

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Hi, I somehow fail to be able to mark all checkboxes on this review page...
However, build and tested with all passed successfully on Rocky Linux release 8.9 (Green Obsidian).
Not sure of more reviewing is needed on other Operating Systems since this is only my second review.

Cheers, Martijn.

nb: second mail to see spf is fixed and Thomas receives this message.

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Martijn Wallet (#6)
Re: processes stuck in shutdown following OOM/recovery

On Thu, May 23, 2024 at 9:58 AM Martijn Wallet <martijn@dbre.nu> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Hi, I somehow fail to be able to mark all checkboxes on this review page...
However, build and tested with all passed successfully on Rocky Linux release 8.9 (Green Obsidian).
Not sure of more reviewing is needed on other Operating Systems since this is only my second review.

Thanks!

I'm also hoping to get review of the rather finickity state machine
logic involved from people familiar with that; I think it's right, but
I'd hate to break some other edge case...

nb: second mail to see spf is fixed and Thomas receives this message.

FTR 171641337152.1103.7326466732639994038.pgcf@coridan.postgresql.org
and 171641505305.1105.9868637944637520353.pgcf@coridan.postgresql.org
both showed up in my inbox, and they both have headers "Received-SPF:
pass ...".

#8Martijn Wallet
martijn@dbre.nu
In reply to: Thomas Munro (#7)
Re: processes stuck in shutdown following OOM/recovery

Great, thanks for the feedback. It was probably the DKIM.

#9Martijn Wallet
martijn@dbre.nu
In reply to: Martijn Wallet (#8)
Re: processes stuck in shutdown following OOM/recovery

Last test to have a verified mail, added lists.postgresql.org to spf record. Cheers.

#10Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#7)
Re: processes stuck in shutdown following OOM/recovery

At commit 779972e, I got about 50% "pg_ctl: server does not shut down" from
$SUBJECT with this loop:

nti=; while date; do PGCTLTIMEOUT=4 make check-tests TESTS=reindex_catalog PG_TEST_INITDB_EXTRA_OPTS='-cwal_level=minimal -cmax_wal_senders=0' NO_TEMP_INSTALL=$nti; nti=1; grep abnormal src/test/regress/log/postmaster.log; done

Your patch fixes that. (It came to my attention while developing commit
67bab53. At that commit or later, this recipe won't see the problem.) The
patch is a strict improvement, so I would be marking
https://commitfest.postgresql.org/50/4884/ Ready for Committer if it weren't
already in that state. That said, ...

On Thu, May 23, 2024 at 10:29:13AM +1200, Thomas Munro wrote:

I'm also hoping to get review of the rather finickity state machine
logic involved from people familiar with that; I think it's right, but
I'd hate to break some other edge case...

... while I don't see an edge case it makes worse, I would fix the defect
differently, to cover more existing edge cases. I think a good standard is to
align with restart_after_crash=off behaviors. With restart_after_crash=off,
the postmaster exits with "shutting down because \"restart_after_crash\" is
off". Assume an external service manager then restarts it. Any deviations
between that experience and running with restart_after_crash=on should have a
specific reason. So post-OOM recovery should mirror original crash recovery
of a fresh postmaster.

List of behaviors the postmaster FatalError variable controls:

arms "issuing %s to recalcitrant children"
disarms "terminating any other active server processes"
changes CAC_STARTUP to CAC_RECOVERY
makes PM_WAIT_BACKENDS assume checkpointer already got SIGQUIT [causes $SUBJECT hang]
exit(1), as opposed to exit(0)
LOG "abnormal database system shutdown"
"all server processes terminated; reinitializing"
disarm maybe_start_bgworkers

Once we launch a new startup process for post-OOM recovery, I think we want
all of those behaviors to cease. In particular, disarming
maybe_start_bgworkers() and disarming "terminating any other active server
processes" are bad at that point. (And the $SUBJECT hang is bad, of course.)
What do you think? The rest are more cosmetic. (We don't document the exit
status, and pg_ctl ignores it. CAC_STARTUP gives me the most ambivalence, but
its cosmetics don't matter too much.) Disabling all those entails essentially
the following, though it would deserve comment edits and removal of the later
FatalError=false:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3144,8 +3144,9 @@ PostmasterStateMachine(void)
 		StartupPID = StartChildProcess(B_STARTUP);
 		Assert(StartupPID != 0);
 		StartupStatus = STARTUP_RUNNING;
 		SetPmState(PM_STARTUP);
+		FatalError = false;
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;

Your patch eliminate all hangs for me, but about half the iterations of the
above test command still get exit(1) and the "abnormal" message. That could
be fine. Perhaps that matches what would happen if the fast shutdown arrived
earlier, after the postmaster observed the crash and before the last backend
reacted to SIGQUIT. Still, I'd lean toward clearing FatalError once we launch
the startup process. Again, two other behaviors it controls feel bad at that
point, and none of the behaviors it controls feel clearly-good at that point.

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3748,12 +3748,14 @@ PostmasterStateMachine(void)
WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
-			 (!FatalError && Shutdown < ImmediateShutdown)) &&
+			 (!FatalError && Shutdown < ImmediateShutdown) ||
+			 (FatalError && CheckpointerPID != 0)) &&

I think that's equivalent to:

(CheckpointerPID == 0 || FatalError || Shutdown < ImmediateShutdown)

Separable topic: this related comment and code are already out of date:

/*
* Unexpected exit of startup process (including FATAL exit)
* during PM_STARTUP is treated as catastrophic. There are no
* other processes running yet, so we can just exit.
*/
if (pmState == PM_STARTUP &&
StartupStatus != STARTUP_SIGNALED &&
!EXIT_STATUS_0(exitstatus))
{
LogChildExit(LOG, _("startup process"),
pid, exitstatus);
ereport(LOG,
(errmsg("aborting startup due to startup process failure")));
ExitPostmaster(1);
}

There are other children, like the checkpointer. Our normal practice would be
to SIGQUIT them and wait. (These days, the children notice postmaster exit
quickly, so it doesn't come up much).

Thanks,
nm

#11Kirill Reshke
reshkekirill@gmail.com
In reply to: Thomas Munro (#4)
Re: processes stuck in shutdown following OOM/recovery

On Wed, 6 Mar 2024 at 02:22, Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Dec 2, 2023 at 3:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Dec 2, 2023 at 2:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid

This affects v15, and fails at ) but not its parent.

Repro'd here. I had to make the sleep shorter on my system. Looking...

The PostmasterStateMachine() case for PM_WAIT_BACKENDS doesn't tell
the checkpointer to shut down in this race case. We have
CheckpointerPID != 0 (because 7ff23c6d27 starts it earlier than
before), and FatalError is true because a child recently crashed and
we haven't yet received the PMSIGNAL_RECOVERY_STARTED handler that
would clear it. Hmm.

Here is a first attempt at fixing this. I am not yet 100% sure if it
is right, and there may be a nicer/simpler way to express the
conditions. It passes the test suite, and it fixes the repro that
Justin posted. FYI on my machine I had to use sleep 0.005 where he
had 0.05, as an FYI if someone else is trying to reproduce the issue.

Hi!
This patch needs a rebase.CF entry status now is Waiting On author.

--
Best regards,
Kirill Reshke