[PATCH] postmaster: fix stale PM_STARTUP comment

Started by Ayush Tiwariabout 1 month ago8 messageshackers
Jump to latest
#1Ayush Tiwari
ayushtiwari.slg01@gmail.com

Hi,

The comment above the PM_STARTUP startup-process-failure case still says
that there are no other processes running yet, so the postmaster can just
exit.

That no longer matches the current startup flow: PM_STARTUP may already
have auxiliary processes running by that point. The attached patch updates
that comment to describe the current behavior.

Regards,
Ayush

Attachments:

0001-postmaster-fix-stale-pm_startup-comment.patchapplication/octet-stream; name=0001-postmaster-fix-stale-pm_startup-comment.patchDownload+3-3
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ayush Tiwari (#1)
Re: [PATCH] postmaster: fix stale PM_STARTUP comment

On 15/04/2026 16:57, Ayush Tiwari wrote:

Hi,

The comment above the PM_STARTUP startup-process-failure case still says
that there are no other processes running yet, so the postmaster can just
exit.

That no longer matches the current startup flow: PM_STARTUP may already
have auxiliary processes running by that point. The attached patch updates
that comment to describe the current behavior.

Hmm, shouldn't the postmaster kill and wait for the auxiliary processes
to exit first in that case? ISTM we need code changes here, not just
comments.

- Heikki

#3Ayush Tiwari
ayushtiwari.slg01@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: [PATCH] postmaster: fix stale PM_STARTUP comment

Hi

On Wed, 15 Apr 2026 at 22:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 15/04/2026 16:57, Ayush Tiwari wrote:

Hi,

The comment above the PM_STARTUP startup-process-failure case still says
that there are no other processes running yet, so the postmaster can just
exit.

That no longer matches the current startup flow: PM_STARTUP may already
have auxiliary processes running by that point. The attached patch

updates

that comment to describe the current behavior.

Hmm, shouldn't the postmaster kill and wait for the auxiliary processes
to exit first in that case? ISTM we need code changes here, not just
comments.

- Heikki

Yes, I agree, code change is required here.

The proper thing is to
route this through the existing crash-handling path so the postmaster
SIGQUITs the aux children and waits for them to exit before terminating.

I think the minimal change is:

1. Replace the ExitPostmaster(1) shortcut in the PM_STARTUP
startup-failure case with HandleChildCrash(), which calls
TerminateChildren(SIGQUIT) and transitions through the state
machine. Set StartupStatus = STARTUP_CRASHED so the state
machine does not try to reinitialize.

2. Let HandleFatalError() handle PM_STARTUP by transitioning to
PM_WAIT_BACKENDS, instead of the current Assert(false).

The state machine already handles STARTUP_CRASHED at PM_NO_CHILDREN
("shutting down due to startup process failure"), so the exit path is
already correct once all children have drained.

This issue was discussed in an older thread by Noah too, so, adding him in
cc.

I can send in a proper patch if you think this is the right way to go.

Regards,
Ayush

#4Ayush Tiwari
ayushtiwari.slg01@gmail.com
In reply to: Ayush Tiwari (#3)
Re: [PATCH] postmaster: fix stale PM_STARTUP comment

Hi,

On Wed, 15 Apr 2026 at 23:31, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
wrote:

Hi

On Wed, 15 Apr 2026 at 22:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 15/04/2026 16:57, Ayush Tiwari wrote:

Hi,

The comment above the PM_STARTUP startup-process-failure case still says
that there are no other processes running yet, so the postmaster can

just

exit.

That no longer matches the current startup flow: PM_STARTUP may already
have auxiliary processes running by that point. The attached patch

updates

that comment to describe the current behavior.

Hmm, shouldn't the postmaster kill and wait for the auxiliary processes
to exit first in that case? ISTM we need code changes here, not just
comments.

- Heikki

Yes, I agree, code change is required here.

The proper thing is to
route this through the existing crash-handling path so the postmaster
SIGQUITs the aux children and waits for them to exit before terminating.

I think the minimal change is:

1. Replace the ExitPostmaster(1) shortcut in the PM_STARTUP
startup-failure case with HandleChildCrash(), which calls
TerminateChildren(SIGQUIT) and transitions through the state
machine. Set StartupStatus = STARTUP_CRASHED so the state
machine does not try to reinitialize.

2. Let HandleFatalError() handle PM_STARTUP by transitioning to
PM_WAIT_BACKENDS, instead of the current Assert(false).

The minimal fix turned out to be smaller than I first described, the
existing paragraph immediately below the ExitPostmaster(1) block already
handles !EXIT_STATUS_0 with StartupStatus != STARTUP_SIGNALED correctly
(sets STARTUP_CRASHED and HandleChildCrash). So, likely fix would be:

1. Deleting the PM_STARTUP ExitPostmaster(1) shortcut, and letting
execution fall through to the next stanza.

2. Replacing the Assert(false) for PM_STARTUP in HandleFatalError() with a
fall-through to UpdatePMState(PM_WAIT_BACKENDS).

Verification that I did for patch:

On a fresh initdb'd cluster, I zeroed out the first WAL segment to force
the startup process to FATAL at StartupXLOG, then ran PG in foreground
under strace.

Before (master):
LOG: startup process (PID N) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down

strace of the postmaster PID shows 0 kill() calls to children before
exit_group(1). Checkpointer, bgwriter and io workers were running at
the time of the failure and were orphaned.

After (patched):
LOG: startup process (PID N) exited with exit code 1
LOG: terminating any other active server processes
<state transitions PM_STARTUP -> PM_WAIT_BACKENDS -> PM_WAIT_DEAD_END
-> PM_NO_CHILDREN>
LOG: shutting down due to startup process failure
LOG: database system is shut down

strace shows 8 SIGQUIT deliveries (4 children, each signaled by PID
and by process-group) before the postmaster's own exit_group(1).

I've attached a patch, please review and let me know your thoughts.

Regards,
Ayush

Attachments:

0001-postmaster-drain-aux-processes-on-startup-process-fa.patchapplication/octet-stream; name=0001-postmaster-drain-aux-processes-on-startup-process-fa.patchDownload+5-25
#5Michael Paquier
michael@paquier.xyz
In reply to: Ayush Tiwari (#4)
Re: [PATCH] postmaster: fix stale PM_STARTUP comment

On Fri, Apr 17, 2026 at 07:17:18PM +0530, Ayush Tiwari wrote:

I've attached a patch, please review and let me know your thoughts.

/*
- * Unexpected exit of startup process (including FATAL exit)
- * during PM_STARTUP is treated as catastrophic. There are no
- * other processes running yet, so we can just exit.
- */
- if (pmState == PM_STARTUP &&
- StartupStatus != STARTUP_SIGNALED &&
- !EXIT_STATUS_0(exitstatus))

The assumption that only the startup process is running while we are
in a PM_STARTUP state is wrong since 7ff23c6d277d, and this exit code
path has been missed the fact that now the checkpointer and the
bgwriter are started during recovery. So this means a backpatch.

Removing the assertion is the right move, yes, there are other
children, so again that's an issue with 7ff23c6d277d. I am planning
to double-check the shutdown sequence while switching to
PM_WAIT_BACKENDS under a PM_STARTUP, just note that bgworkers that use
BgWorkerStart_PostmasterStart cannot request a database connection,
meaning that PM_WAIT_BACKENDS should be pointeless, but perhaps it
makes the whole shutdown flow easier to reason about, as you are
suggesting, as making this path more complicated would lead to the
addition of more postmaster states. Making this code simpler, not
more complicated, is always useful.
--
Michael

#6Ayush Tiwari
ayushtiwari.slg01@gmail.com
In reply to: Michael Paquier (#5)
Re: [PATCH] postmaster: fix stale PM_STARTUP comment

Hi,

Thanks a lot for the review!.

On Sat, 18 Apr 2026 at 04:24, Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Apr 17, 2026 at 07:17:18PM +0530, Ayush Tiwari wrote:

I've attached a patch, please review and let me know your thoughts.

/*
- * Unexpected exit of startup process (including FATAL exit)
- * during PM_STARTUP is treated as catastrophic. There are no
- * other processes running yet, so we can just exit.
- */
- if (pmState == PM_STARTUP &&
- StartupStatus != STARTUP_SIGNALED &&
- !EXIT_STATUS_0(exitstatus))

The assumption that only the startup process is running while we are
in a PM_STARTUP state is wrong since 7ff23c6d277d, and this exit code
path has been missed the fact that now the checkpointer and the
bgwriter are started during recovery. So this means a backpatch.

Agreed, this is latent since 7ff23c6d277d (v15). I can prepare a
back-branch
versions patch for v15..master once the master patch shape is settled

Removing the assertion is the right move, yes, there are other
children, so again that's an issue with 7ff23c6d277d. I am planning
to double-check the shutdown sequence while switching to
PM_WAIT_BACKENDS under a PM_STARTUP, just note that bgworkers that use
BgWorkerStart_PostmasterStart cannot request a database connection,
meaning that PM_WAIT_BACKENDS should be pointeless, but perhaps it
makes the whole shutdown flow easier to reason about, as you are
suggesting, as making this path more complicated would lead to the
addition of more postmaster states. Making this code simpler, not
more complicated, is always useful.
--
Michael

Right, I believe at PM_STARTUP the only children we can possibly have are
the auxiliary processes (checkpointer, bgwriter, walwriter (if applicable),
IO workers) and BgWorkerStart_PostmasterStart bgworkers,
none of which should hold backend slots. So, PM_WAIT_BACKENDS is
functionally
a no-op in this case and we transition straight through to
PM_WAIT_DEAD_END.

I considered short-circuiting directly to
PM_WAIT_DEAD_END from PM_STARTUP, but routing through the same
state path the rest of the crash-handling code uses keeps the state machine
uniform and avoids a special case in HandleFatalError() /
PostmasterStateMachine().
Making the code simpler, as you mentioned.

Regards,
Ayush

#7Michael Paquier
michael@paquier.xyz
In reply to: Ayush Tiwari (#6)
Re: [PATCH] postmaster: fix stale PM_STARTUP comment

On Mon, Apr 20, 2026 at 12:11:45PM +0530, Ayush Tiwari wrote:

Agreed, this is latent since 7ff23c6d277d (v15). I can prepare a
back-branch versions patch for v15..master once the master patch
shape is settled.

No need to. I have poked at this problem a bit more, stracing my way
as you did, and after more testing across v15~HEAD, I have applied it.
For v15, a difference becomes necessary at HandleChildCrash(), or we
would begin to fail the shutdown sequence should the startup process
have the idea to PANIC. This maps with the changes in v18 and HEAD
where this has been replaced by a switch/case.

Another thing that I have spent a long time looking at is
process_pm_child_exit() and the interference that this could generate
for the startup process case, but here as well I did not spot any
issue, so I think that we are in the clear.

There was also a comment at the top of postmaster.c that incorrectly
claimed that the checkpointer and the background writer were only
started after switch to PM_RECOVERY, which was wrong. I have tweaked
that while on it.

That was a good catch overall.
--
Michael

#8Ayush Tiwari
ayushtiwari.slg01@gmail.com
In reply to: Michael Paquier (#7)
Re: [PATCH] postmaster: fix stale PM_STARTUP comment

On Tue, 21 Apr 2026 at 06:30, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 20, 2026 at 12:11:45PM +0530, Ayush Tiwari wrote:

No need to. I have poked at this problem a bit more, stracing my way
as you did, and after more testing across v15~HEAD, I have applied it.
For v15, a difference becomes necessary at HandleChildCrash(), or we
would begin to fail the shutdown sequence should the startup process
have the idea to PANIC. This maps with the changes in v18 and HEAD
where this has been replaced by a switch/case.

Another thing that I have spent a long time looking at is
process_pm_child_exit() and the interference that this could generate
for the startup process case, but here as well I did not spot any
issue, so I think that we are in the clear.

There was also a comment at the top of postmaster.c that incorrectly
claimed that the checkpointer and the background writer were only
started after switch to PM_RECOVERY, which was wrong. I have tweaked
that while on it.

That was a good catch overall.

Thanks a lot Michael for looking into it and pushing it!

Regards,
Ayush