Reorder shutdown sequence, to flush pgstats later
Hi,
In the AIO patchset I just hit a corner case in which IO workers can emit
stats ([1]A buffered write completes and RememberSyncRequest() fails, leading to the IO workers performing the flush itself.). That in turn can trigger an assertion failure, because by the
time the IO workers are shut down, checkpointer already has exited and written
out the stats. In this case the relevant IO has completed, but nothing
triggered a stats flush in the IO workers before checkpointer exited.
I think Bilal has also hit this problem somewhat recently, as part of his work
to track WAL IO in pg_stat_io. We intentionally allow walsenders to be active
after checkpointer writes out the shutdown checkpoint and exits.
Robert suggested, when chatting with him about this problem, that we could use
a global barrier to trigger pending stats to be flushed before checkpointer
exits. While that, I think, would fix "my" issue with pending stats in IO
workers, I don't think it'd fix Bilal's problem.
I think that to properly fix this we need a place to flush stats after
everything has shut down.
One way we could do this would be to introduce a new aux process that's
started at the end of the shutdown sequence. But that seems a bit too big a
hammer, without corresponding benefit.
I instead opted to somewhat rearrange the shutdown sequence:
This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.
Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER PMState.
In addition, the existing PM_SHUTDOWN_2 state is renamed to
PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.
This patch is not fully polished, there are a few details I'm not sure about:
- Previously the shutdown checkpoint and process exit were triggered from
HandleCheckpointerInterrupts(). I could have continued doing so in the
patch, but it seemed quite weird to have a wait loop in a function called
HandleCheckpointerInterrupts().
Instead I made the main loop in CheckpointerMain() terminate if checkpointer
was asked to write the shutdown checkpoint or exit
- In process_pm_pmsignal() the code now triggers a PostmasterStateMachine() if
checkpointer signalled ShutdownXLOG having completed. We didn't really have
a need for that before. process_pm_child_exit() does call
PostmasterStateMachine(), but somehow calling it in process_pm_pmsignal()
feels slightly off
- I don't love the naming of the various PMState values (existing and new),
but a larger renaming should probably be done separately?
- There's logging in ShutdownXLOG() that's only triggered if called from
checkpointer. Seems kinda odd - shouldn't we just move that to
checkpointer.c?
The first two patches are just logging improvements that I found helpful to
write this patch:
0001: Instead of modifying pmState directly, do so via a new function
UpdatePMState(), which logs the state transition at DEBUG2.
Requires a helper to stringify PMState.
I've written one-off versions of this patch quite a few times. Without
knowing in what state postmaster is in, it's quite hard to debug the
state machine.
0002: Make logging of postmaster signalling child processes more consistent
I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().
The next is a small prerequisite:
0003: postmaster: Introduce variadic btmask_all_except()
I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.
And then 0004, the reason for this thread.
Comments? Better ideas?
Greetings,
Andres Freund
[1]: A buffered write completes and RememberSyncRequest() fails, leading to the IO workers performing the flush itself.
IO workers performing the flush itself.
Attachments:
v1-0001-postmaster-Update-pmState-in-new-function.patchtext/x-diff; charset=us-asciiDownload+57-22
v1-0002-postmaster-Improve-logging-of-signals-sent-by-pos.patchtext/x-diff; charset=us-asciiDownload+51-11
v1-0003-postmaster-Introduce-variadic-btmask_all_except.patchtext/x-diff; charset=us-asciiDownload+9-13
v1-0004-WIP-Change-shutdown-sequence-to-terminate-checkpo.patchtext/x-diff; charset=us-asciiDownload+170-77
On 08/01/2025 04:10, Andres Freund wrote:
I instead opted to somewhat rearrange the shutdown sequence:
This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER PMState.In addition, the existing PM_SHUTDOWN_2 state is renamed to
PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.
Sounds good.
This patch is not fully polished, there are a few details I'm not sure about:
- Previously the shutdown checkpoint and process exit were triggered from
HandleCheckpointerInterrupts(). I could have continued doing so in the
patch, but it seemed quite weird to have a wait loop in a function called
HandleCheckpointerInterrupts().Instead I made the main loop in CheckpointerMain() terminate if checkpointer
was asked to write the shutdown checkpoint or exit- In process_pm_pmsignal() the code now triggers a PostmasterStateMachine() if
checkpointer signalled ShutdownXLOG having completed. We didn't really have
a need for that before. process_pm_child_exit() does call
PostmasterStateMachine(), but somehow calling it in process_pm_pmsignal()
feels slightly off
Looks good to me.
- I don't love the naming of the various PMState values (existing and new),
but a larger renaming should probably be done separately?
We currently have PM_WAIT_BACKENDS and PM_WAIT_DEAD_END. Maybe follow
that example and name all the shutdown states after what you're waiting for:
PM_WAIT_BACKENDS, /* waiting for live backends to exit */
PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown
ckpt */
PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
finish */
PM_WAIT_DEAD_END, /* waiting for dead-end children to exit */
PM_WAIT_CHECKPOINTER, /* waiting for checkpointer to shut down */
PM_NO_CHILDREN, /* all important children have exited */
- There's logging in ShutdownXLOG() that's only triggered if called from
checkpointer. Seems kinda odd - shouldn't we just move that to
checkpointer.c?
What logging do you mean? The "LOG: shutting down" message? No
objections to moving it, although it doesn't bother me either way.
The first two patches are just logging improvements that I found helpful to
write this patch:0001: Instead of modifying pmState directly, do so via a new function
UpdatePMState(), which logs the state transition at DEBUG2.Requires a helper to stringify PMState.
I've written one-off versions of this patch quite a few times. Without
knowing in what state postmaster is in, it's quite hard to debug the
state machine.
+1
elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
pmState, pmstate_name(pmState), newState, pmstate_name(newState));
I think the state names are enough, no need to print the numerical values.
0002: Make logging of postmaster signalling child processes more consistent
I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().
+1
I don't think the cases where DEBUG2 or DEBUG4 are used currently are
very well thought out. To save a few lines of code, I'd be happy with
merging signal_child_ext() and signal_child() and just always use DEBUG2
or DEBUG4. Or DEBUG3 as a compromise :-).
The next is a small prerequisite:
0003: postmaster: Introduce variadic btmask_all_except()
I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.
Nice. A variadic btmask_add() might be handy too.
And then 0004, the reason for this thread.
Overall, looks good to me.
--
Heikki Linnakangas
Neon (https://neon.tech)
Hi,
On Wed, Jan 08, 2025 at 03:10:03PM +0200, Heikki Linnakangas wrote:
On 08/01/2025 04:10, Andres Freund wrote:
elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
pmState, pmstate_name(pmState), newState, pmstate_name(newState));I think the state names are enough, no need to print the numerical values.
+1
0002: Make logging of postmaster signalling child processes more consistent
I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().+1
Random comments:
=== 1
+static const char *
+pm_signame(int signal)
+{
+#define PM_CASE(state) case state: return #state
+ switch (signal)
What about?
"
#define PM_SIGNAL_CASE(sig) case sig: return #sig
"
to better reflect its context.
=== 2
+ default:
+ /* all signals sent by postmaster should be listed here */
+ Assert(false);
+ return "(unknown)";
+ }
+#undef PM_CASE
+ pg_unreachable();
Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
like dead code).
I don't think the cases where DEBUG2 or DEBUG4 are used currently are very
well thought out. To save a few lines of code, I'd be happy with merging
signal_child_ext() and signal_child()
+1
The next is a small prerequisite:
0003: postmaster: Introduce variadic btmask_all_except()
I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.
LGTM.
Nice. A variadic btmask_add() might be handy too.
And then 0004, the reason for this thread.
Did not give it a detailled review, but IIUC it now provides this flow:
PM_SHUTDOWN->PM_XLOG_IS_SHUTDOWN->PM_WAIT_DEAD_END->PM_SHUTDOWN_CHECKPOINTER->PM_NO_CHILDREN
and that looks good to me to fix the issue.
A few random comments:
=== 3
+ postmaster.c will only
+ * signal checkpointer to exit after all processes that could emit stats
+ * have been shut down.
s/postmaster.c/PostmasterStateMachine()/?
=== 4
+ * too. That allows checkpointer to perform some last bits of of
Typo "of of"
I'll give 0004 a closer look once it leaves the WIP state but want to share that
it looks like a good way to fix the issue.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2025-01-08 15:10:03 +0200, Heikki Linnakangas wrote:
On 08/01/2025 04:10, Andres Freund wrote:
I instead opted to somewhat rearrange the shutdown sequence:
- I don't love the naming of the various PMState values (existing and new),
but a larger renaming should probably be done separately?We currently have PM_WAIT_BACKENDS and PM_WAIT_DEAD_END. Maybe follow that
example and name all the shutdown states after what you're waiting for:PM_WAIT_BACKENDS, /* waiting for live backends to exit */
PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown ckpt
*/
PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
finish */
PM_WAIT_DEAD_END, /* waiting for dead-end children to exit */
PM_WAIT_CHECKPOINTER, /* waiting for checkpointer to shut down */
PM_NO_CHILDREN, /* all important children have exited */
I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
but I can't think of anything better.
It's probably best to break it out into a separate commit, see attached.
- There's logging in ShutdownXLOG() that's only triggered if called from
checkpointer. Seems kinda odd - shouldn't we just move that to
checkpointer.c?What logging do you mean? The "LOG: shutting down" message? No objections to
moving it, although it doesn't bother me either way.
Yea. It's also an oddly unspecific message :).
The first two patches are just logging improvements that I found helpful to
write this patch:0001: Instead of modifying pmState directly, do so via a new function
UpdatePMState(), which logs the state transition at DEBUG2.Requires a helper to stringify PMState.
I've written one-off versions of this patch quite a few times. Without
knowing in what state postmaster is in, it's quite hard to debug the
state machine.+1
elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
pmState, pmstate_name(pmState), newState, pmstate_name(newState));I think the state names are enough, no need to print the numerical values.
I thought the numeric value was mildly helpful as the numerically higher value
makes it slightly easier to understand the "direction" of state changes. But I
also wasn't sure it's worth it. Removed in the attached.
0002: Make logging of postmaster signalling child processes more consistent
I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().+1
I don't think the cases where DEBUG2 or DEBUG4 are used currently are very
well thought out. To save a few lines of code, I'd be happy with merging
signal_child_ext() and signal_child() and just always use DEBUG2 or DEBUG4.
Or DEBUG3 as a compromise :-).
Heh. DEBUG2 seems a bit much compared to some other stuff, e.g. a config
reload will be logged at DEBUG2, seems a bit superfluous to then also use
DEBUG2 for all individual signals.
I was a bit hesitant to lower the level in sigquit_child(), but thinking about
it, the fact that we're gonna signal children is already obvious from the log.
So I went for your compromise position :)
The next is a small prerequisite:
0003: postmaster: Introduce variadic btmask_all_except()
I proposed this in another thread, where I also needed
btmask_all_except3() but then removed the only call to
btmask_all_except2(). Instead of adding/removing code, it seems better
to just use a variadic function.Nice. A variadic btmask_add() might be handy too.
It does seem to make the code a bit more readable. See attached. Wasn't quite
sure in how many btmask_add() invocations to break it up...
And then 0004, the reason for this thread.
One problem I found is this comment:
* Note: we deliberately ignore SIGTERM, because during a standard Unix
* system shutdown cycle, init will SIGTERM all processes at once. We
* want to wait for the backends to exit, whereupon the postmaster will
* tell us it's okay to shut down (via SIGUSR2).
Which would cause problems with v1 of the patch, as postmaster would write a
shutdown checkpoint in response to SIGTERM.
Unfortunately we don't have any obvious unused signals available, as SIGINT is
used to request a checkpoint.
To deal with that I added a new patch to the stack, which makes
RequestCheckpoint() use SetLatch() on checkpointer's latch to wake it up,
instead of kill(SIGINT). Not that that matters, but using SetLatch()
directly, instead of doing SetLatch() in the signal handler, is slightly more
efficient, because it avoids a signal interrupt when checkpointer isn't
waiting on a latch.
The relevant comment mentioned "or is in process of restarting" - but that's
afaict bogus, we don't allow checkpointer to restart.
After that, postmaster can use SIGINT to request checkpointer to write the
shutdown checkpoint.
I'm currently to plan the four patches relatively soon, unless somebody speaks
up, of course. They seem fairly uncontroversial. The renaming of the phases
doesn't need to wait much longer, I think.
The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.
Greetings,
Andres Freund
Attachments:
v2-0001-postmaster-Update-pmState-in-new-function.patchtext/x-diff; charset=us-asciiDownload+57-22
v2-0002-postmaster-Improve-logging-of-signals-sent-by-pos.patchtext/x-diff; charset=us-asciiDownload+42-14
v2-0003-postmaster-Introduce-variadic-btmask_all_except.patchtext/x-diff; charset=us-asciiDownload+9-13
v2-0004-postmaster-Make-btmask_add-variadic.patchtext/x-diff; charset=us-asciiDownload+34-21
v2-0005-postmaster-Rename-some-shutdown-related-PMState-p.patchtext/x-diff; charset=us-asciiDownload+27-26
v2-0006-checkpointer-Request-checkpoint-via-latch-instead.patchtext/x-diff; charset=us-asciiDownload+19-42
v2-0007-Change-shutdown-sequence-to-terminate-checkpointe.patchtext/x-diff; charset=us-asciiDownload+180-75
Hi,
On 2025-01-08 14:48:21 +0000, Bertrand Drouvot wrote:
On Wed, Jan 08, 2025 at 03:10:03PM +0200, Heikki Linnakangas wrote:
On 08/01/2025 04:10, Andres Freund wrote:
0002: Make logging of postmaster signalling child processes more consistent
I found it somewhat hard to understand what's happening during state
changes without being able to see the signals being sent. While we did
have logging in SignalChildren(), we didn't in signal_child(), and most
signals that are important for the shutdown sequence are sent via
signal_child().+1
Random comments:
=== 1
+static const char * +pm_signame(int signal) +{ +#define PM_CASE(state) case state: return #state + switch (signal)What about?
"
#define PM_SIGNAL_CASE(sig) case sig: return #sig
"
to better reflect its context.
I went for PM_TOSTR_CASE - that way the same name can be used in
pmstate_name().
=== 2
+ default: + /* all signals sent by postmaster should be listed here */ + Assert(false); + return "(unknown)"; + } +#undef PM_CASE + pg_unreachable();Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
like dead code).
I don't think so - we're not listing all signals here, just ones that
postmaster is currently using. They're also typically not defined in an enum
allowing the compiler to warn about uncovered values. It seemed better to
actually return something in a production build rather than aborting.
Some compilers tend to be pretty annoying about missing return values, that's
why I added the pg_unreachable(). Perhaps I should have done a
return "" /* silence compilers */;
or such.
Nice. A variadic btmask_add() might be handy too.
And then 0004, the reason for this thread.
Did not give it a detailled review, but IIUC it now provides this flow:
PM_SHUTDOWN->PM_XLOG_IS_SHUTDOWN->PM_WAIT_DEAD_END->PM_SHUTDOWN_CHECKPOINTER->PM_NO_CHILDREN
and that looks good to me to fix the issue.
A few random comments:
=== 3
+ postmaster.c will only + * signal checkpointer to exit after all processes that could emit stats + * have been shut down.s/postmaster.c/PostmasterStateMachine()/?
I just went for 'postmaster' (without the .c) - I don't think it's really
useful to specifically reference s/postmaster.c/PostmasterStateMachine() in
checkpointer.c. But I could be swayed if you feel strongly.
=== 4
+ * too. That allows checkpointer to perform some last bits of of
Typo "of of"
Fixed.
I'll give 0004 a closer look once it leaves the WIP state but want to share that
it looks like a good way to fix the issue.
Cool.
Thanks for the review,
Andres Freund
Hi,
On Wed, Jan 08, 2025 at 02:32:24PM -0500, Andres Freund wrote:
On 2025-01-08 14:48:21 +0000, Bertrand Drouvot wrote:
=== 2
+ default: + /* all signals sent by postmaster should be listed here */ + Assert(false); + return "(unknown)"; + } +#undef PM_CASE + pg_unreachable();Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
like dead code).I don't think so - we're not listing all signals here, just ones that
postmaster is currently using. They're also typically not defined in an enum
allowing the compiler to warn about uncovered values.
Oh right, the parameter is an "int" not an "enum" (and anyway, as you said, we're
not listing all the signals) (did not pay attention to that). So we may need to
"help" some compilers regarding missing return values.
It seemed better to
actually return something in a production build rather than aborting.
Yeah, fully agree.
Perhaps I should have done a
return "" /* silence compilers */;
Using pg_unreachable() is fine but a comment (like the return value you're
suggesting above) could help.
Thanks for the explanation!
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Wed, Jan 08, 2025 at 02:26:15PM -0500, Andres Freund wrote:
I'm currently to plan the four patches relatively soon, unless somebody speaks
up, of course. They seem fairly uncontroversial. The renaming of the phases
doesn't need to wait much longer, I think.
Thanks for the patches.
A few comments:
0001 LGTM.
0002,
=== 1
+static const char *
+pm_signame(int signal)
+{
+#define PM_TOSTR_CASE(state) case state: return #state
+ switch (signal)
s/state/signal/? (seems better in the pm_signame() context)
0003 and 0004 LGTM.
0005,
=== 2
+ PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
I don't love PM_WAIT_XLOG_ARCHIVAL, but I can't think of anything better.
PM_WAIT_ARCHIVER_WALSENDERS maybe? (that would follow the pattern of naming
the processes like PM_WAIT_BACKENDS, PM_WAIT_CHECKPOINTER for example).
That said, I'm not 100% convinced it makes it more clear though...
The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.
0006,
=== 3
+ * when it does start, with or without getting a signal.
s/getting a signal/getting a latch set/ or "getting notified"?
=== 4
+ if (checkpointerProc == INVALID_PROC_NUMBER)
{
if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
{
elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
- "could not signal for checkpoint: checkpointer is not running");
+ "could not notify checkpoint: checkpointer is not running");
Worth renaming MAX_SIGNAL_TRIES with MAX_NOTIFY_TRIES then?
0007,
=== 5
+ pqsignal(SIGINT, ReqShutdownXLOG);
Worth a comment like:
pqsignal(SIGINT, ReqShutdownXLOG); /* trigger shutdown checkpoint */
=== 6
+ * Wait until we're asked to shut down. By seperating the writing of the
Typo: s/seperating/separating/
I don't really anything else in 0006 and 0007 but as you said it's probably
worth a few more eyes as the code change is pretty substantial.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
Thanks for working on this!
On Wed, 8 Jan 2025 at 22:26, Andres Freund <andres@anarazel.de> wrote:
I'm currently to plan the four patches relatively soon, unless somebody speaks
up, of course. They seem fairly uncontroversial. The renaming of the phases
doesn't need to wait much longer, I think.The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.
Some minor comments:
=== 0001
LGTM.
=== 0002
+ }
+#undef PM_TOSTR_CASE
+ pg_unreachable();
+}
Maybe a blank line after '#undef PM_TOSTR_CASE' (or no blank line at
the end of the pmstate_name() at 0001)?
+ ereport(DEBUG3,
+ (errmsg_internal("sending signal %d/%s to %s process %d",
I am not sure if that makes sense but with the addition of the backend
type here, I think 'sending signal %d/%s to %s process with the pid of
%d' would be clearer.
=== 0003
LGTM.
=== 0004
LGTM.
=== 0005
I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
but I can't think of anything better.
I liked this, I think it is good enough.
- PM_SHUTDOWN, /* waiting for checkpointer to do shutdown
+ PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown
* ckpt */
There are couple of variables and functions which include pm_shutdown
in their names:
pending_pm_shutdown_request
handle_pm_shutdown_request_signal()
process_pm_shutdown_request()
Do you think these need to be updated as well?
=== 0006
Please see below.
=== 0007
- * told to shut down. We expect that it wrote a shutdown
- * checkpoint. (If for some reason it didn't, recovery will
- * occur on next postmaster start.)
+ * told to shut down. We know it wrote a shutdown checkpoint,
+ * otherwise we'd still be in PM_SHUTDOWN.
s/PM_SHUTDOWN/PM_WAIT_XLOG_SHUTDOWN/
I have these comments for now. I need to study 0006 and 0007 more.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On 2025-01-09 16:50:45 +0300, Nazir Bilal Yavuz wrote:
On Wed, 8 Jan 2025 at 22:26, Andres Freund <andres@anarazel.de> wrote:
=== 0005I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
but I can't think of anything better.I liked this, I think it is good enough.
- PM_SHUTDOWN, /* waiting for checkpointer to do shutdown + PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown * ckpt */There are couple of variables and functions which include pm_shutdown
in their names:
pending_pm_shutdown_request
handle_pm_shutdown_request_signal()
process_pm_shutdown_request()Do you think these need to be updated as well?
I don't think so - I think those are actually not specifically referencing the
PM_SHUTDOWN symbol. They're referencing shutting down postmaster - which
neither starts nor ends with PM_SHUTDOWN.
Greetings,
Andres Freund
Hi,
Thanks for the reviews!
On 2025-01-09 12:01:05 +0000, Bertrand Drouvot wrote:
On Wed, Jan 08, 2025 at 02:26:15PM -0500, Andres Freund wrote:
=== 2+ PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
I don't love PM_WAIT_XLOG_ARCHIVAL, but I can't think of anything better.
PM_WAIT_ARCHIVER_WALSENDERS maybe? (that would follow the pattern of naming
the processes like PM_WAIT_BACKENDS, PM_WAIT_CHECKPOINTER for example).That said, I'm not 100% convinced it makes it more clear though...
Yea, I just went with PM_WAIT_XLOG_ARCHIVAL, it's ok enough.
The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.0006,
=== 3
+ * when it does start, with or without getting a signal.
s/getting a signal/getting a latch set/ or "getting notified"?
I went with "with or without the SetLatch()".
=== 4
+ if (checkpointerProc == INVALID_PROC_NUMBER) { if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT)) { elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG, - "could not signal for checkpoint: checkpointer is not running"); + "could not notify checkpoint: checkpointer is not running");Worth renaming MAX_SIGNAL_TRIES with MAX_NOTIFY_TRIES then?
IDK, didn't seem worth it. SetLatch() is a form of signalling and I don't
think "notify" really is better. And it requires changing more lines...
=== 5
+ pqsignal(SIGINT, ReqShutdownXLOG);
Worth a comment like:
pqsignal(SIGINT, ReqShutdownXLOG); /* trigger shutdown checkpoint */
That seems just a restatement of the function name, don't think we gain
anything by that.
Greetings,
Andres Freund
Hi,
I've pushed 0001-0005.
I think 0006 isn't far from ready.
However, I think 0007 needs a bit more work.
One thing that worries me a bit is that using SIGINT for triggering the
shutdown checkpoint could somehow be unintentionally triggered? Previously
checkpointer would effectively have ignored that, but now it'd bring the whole
cluster down (because postmaster would see an unexpected ShutdownXLOG()). But
I can't really see a legitimate reason for checkpointer to get spurious
SIGINTs.
This made me think about what happens if a spurious SIGINT is sent - and in
the last version the answer wasn't great, namely everything seemed to go on
normal, except that the cluster couldn't be shut down normally. There wasn't
any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
fatal. I think this needs to be treated the same way we treat not being able
to fork checkpointer during a shutdown. Now receiving
PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
FatalError processing after logging "WAL was shut down unexpectedly". We have
multiple copies of code to go to FatalError, that doesn't seem great.
Another thing I started worrying about is that the commit added
PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
where we directly jump to PM_WAIT_DEAD_END in fatal situations:
/*
* Stop any dead-end children and stop creating new ones.
*/
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);
SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
and
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);
I don't think this actively causes problems today, but it seems rather iffy.
Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
thought it'd be better to shut checkpointer down after even dead-end children
exited, in case we ever wanted to introduce stats for dead-end children - but
that seems rather unlikely?
I also noticed that checkpointer.c intro comment needed some editing.
Attached are the rebased and edited last two patches.
Independent of this patch, we seem to deal with FatalError processing in a
somewhat inconsistently. We have this comment (in master):
/*
* PM_WAIT_DEAD_END state ends when all other children are gone except
* for the logger. During normal shutdown, all that remains are
* dead-end backends, but in FatalError processing we jump straight
* here with more processes remaining. Note that they have already
* been sent appropriate shutdown signals, either during a normal
* state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.
However that doesn't actually appear to be true in the most common way to
reach FatalError, via HandleChildCrash():
if (Shutdown != ImmediateShutdown)
FatalError = true;
/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_WAIT_XLOG_SHUTDOWN)
UpdatePMState(PM_WAIT_BACKENDS);
Here we
a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDS
From PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
other than walsender, archiver and dead-end children exited.
b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
PM_WAIT_BACKENDS?
This comment seems to have been added in
commit a78af0427015449269fb7d9d8c6057cfcb740149
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-11-14 16:12:28 +0200
Assign a child slot to every postmaster child process
ISTM that section of the comment is largely wrong and has never really been
the case if my git sleuthing is correct?
We do have one place that directly switches to PM_WAIT_DEAD_END:
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);
but I suspect that's just about unreachable these days. If checkpointer exits
outside of shutdown processing, it's treated as as a reason to crash-restart:
/*
* Any unexpected exit of the checkpointer (including FATAL
* exit) is treated as a crash.
*/
HandleChildCrash(pid, exitstatus,
_("checkpointer process"));
During postmaster's first start checkpointer is launched before the startup
process is even started. During crash restart processing, we do start the
startup process before checkpointer, but start checkpointer in
LaunchMissingBackgroundProcesses().
It looks like it's theoretically possible that recovery completes before ever
reach LaunchMissingBackgroundProcesses() - but it seems like we should address
that by having the same "process startup order" during crash recovery
processing as we have during the first start?
I think this oddity originated in
commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
Author: Thomas Munro <tmunro@postgresql.org>
Date: 2021-08-02 17:32:20 +1200
Run checkpointer and bgwriter in crash recovery.
ISTM that we should change it so that the first and crash recovery scenarios
are more similar. Thomas, what do you think?
Separately, is it a problem that somewhere in process_pm_* we did
ConfigurePostmasterWaitSet(false) but then the event loop in ServerLoop()
continues to process "old" WL_SOCKET_ACCEPT events?
Greetings,
Andres Freund
Hi,
On Fri, Jan 10, 2025 at 02:07:25PM -0500, Andres Freund wrote:
However, I think 0007 needs a bit more work.
One thing that worries me a bit is that using SIGINT for triggering the
shutdown checkpoint could somehow be unintentionally triggered? Previously
checkpointer would effectively have ignored that,
Right.
but now it'd bring the whole
cluster down (because postmaster would see an unexpected ShutdownXLOG()). But
I can't really see a legitimate reason for checkpointer to get spurious
SIGINTs.
Yeah, appart from human error on the OS side, I don't see any reasons too.
This made me think about what happens if a spurious SIGINT is sent - and in
the last version the answer wasn't great, namely everything seemed to go on
normal, except that the cluster couldn't be shut down normally.
Yeah, and I think that checkpoints would hang.
There wasn't
any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
fatal. I think this needs to be treated the same way we treat not being able
to fork checkpointer during a shutdown. Now receiving
PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
FatalError processing after logging "WAL was shut down unexpectedly".
I wonder if we could first ask the postmaster a confirmation that the SIGINT is
coming from it? (and if not, then simply ignore the SIGINT). I thought about a
shared memory flag but the postmaster has to be keept away from shared memory
operations, so that's not a valid idea. Another idea could be that the checkpointer
sends a new "confirm" request to the postmater first. But then I'm not sure how
the postmaster could reply back (given the signals that are already being used)
and that might overcomplicate things.
So yeah, might be better to be on the safe side of thing and "simply" enter Fatal.
We have
multiple copies of code to go to FatalError, that doesn't seem great.
+ * FIXME: This should probably not have a copy fo the code to
+ * transition into FatalError state.
+ */
+ ereport(LOG,
+ (errmsg("WAL was shut down unexpectedly")));
Indeed, might be worth extracting this into a helper function or such.
Another thing I started worrying about is that the commit added
PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
where we directly jump to PM_WAIT_DEAD_END in fatal situations:/*
* Stop any dead-end children and stop creating new ones.
*/
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);
SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
and
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);I don't think this actively causes problems today,
For the first case, we'd ask the checkpointer to do work in the PM_WAIT_DEAD_END
case while already SIGQUIT'd (through SignalHandlerForCrashExit()). But that's
not an issue per say as we check for CheckpointerPMChild != NULL when
pmState == PM_WAIT_DEAD_END.
For the second case (where we failed to fork a checkpointer), we'd also
check for CheckpointerPMChild != NULL when pmState == PM_WAIT_DEAD_END.
So in both case we'd ask a non existing checkpointer to do work but we later
check that it exists, so I also don't think that it causes any problem.
but it seems rather iffy.
Yeah, we intend to ask a non existing checkpointer process to do work...
Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
thought it'd be better to shut checkpointer down after even dead-end children
exited, in case we ever wanted to introduce stats for dead-end children - but
that seems rather unlikely?
Yeah, given the above, it looks more clean to change that ordering.
Indeed that seems unlikely to introduce dead-end children stats and even if we do
, we could think back to put PM_WAIT_CHECKPOINTER after PM_WAIT_DEAD_END (or any
oher way to ensure the stats are flushed after they are shut down). So it's
probably better to focus on the current state and then put
PM_WAIT_CHECKPOINTER before PM_WAIT_DEAD_END.
Anyway, if we reach the 2 cases mentioned above we're not going to flush the
stats anyway (CheckpointerPMChild is NULL) so better to reorder then.
Independent of this patch, we seem to deal with FatalError processing in a
somewhat inconsistently. We have this comment (in master):
/*
* PM_WAIT_DEAD_END state ends when all other children are gone except
* for the logger. During normal shutdown, all that remains are
* dead-end backends, but in FatalError processing we jump straight
* here with more processes remaining. Note that they have already
* been sent appropriate shutdown signals, either during a normal
* state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.However that doesn't actually appear to be true in the most common way to
reach FatalError, via HandleChildCrash():if (Shutdown != ImmediateShutdown)
FatalError = true;/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_WAIT_XLOG_SHUTDOWN)
UpdatePMState(PM_WAIT_BACKENDS);Here we
a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDSFrom PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
other than walsender, archiver and dead-end children exited.b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
PM_WAIT_BACKENDS?This comment seems to have been added in
commit a78af0427015449269fb7d9d8c6057cfcb740149
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-11-14 16:12:28 +0200Assign a child slot to every postmaster child process
ISTM that section of the comment is largely wrong and has never really been
the case if my git sleuthing is correct?
I agree that does not look correct but I don't think it's coming from a78af0427015449269fb7d9d8c6057cfcb740149.
ISTM that a78af0427015449269fb7d9d8c6057cfcb740149 re-worded this already wrong
comment:
"
/*
* PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
* (ie, no dead_end children remain), and the archiver is gone too.
*
* The reason we wait for those two is to protect them against a new
* postmaster starting conflicting subprocesses; this isn't an
* ironclad protection, but it at least helps in the
* shutdown-and-immediately-restart scenario. Note that they have
* already been sent appropriate shutdown signals, either during a
* normal state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.
*/
"
while we could also see:
"
if (Shutdown != ImmediateShutdown)
FatalError = true;
/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;
"
It seems to me that it has been introduced in e6a442c71b30f62e7b5eee6058afc961b1c7f29b
where the above comment has been added and does not align with what HandleChildCrash()
was already doing (was already setting PM_WAIT_BACKENDS).
But anyway I do agree that this comment look wrong.
We do have one place that directly switches to PM_WAIT_DEAD_END:
/*
* If we failed to fork a checkpointer, just shut down.
* Any required cleanup will happen at next restart. We
* set FatalError so that an "abnormal shutdown" message
* gets logged when we exit.
*
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
*/
FatalError = true;
UpdatePMState(PM_WAIT_DEAD_END);
ConfigurePostmasterWaitSet(false);but I suspect that's just about unreachable these days. If checkpointer exits
outside of shutdown processing, it's treated as as a reason to crash-restart:
/*
* Any unexpected exit of the checkpointer (including FATAL
* exit) is treated as a crash.
*/
HandleChildCrash(pid, exitstatus,
_("checkpointer process"));
I do agree. I was trying to reach this code path while replying to one of
the above comments (about PM_WAIT_DEAD_END ordering) but had to replace with
another if condition to reach it during testing.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2025-01-13 12:20:39 +0000, Bertrand Drouvot wrote:
On Fri, Jan 10, 2025 at 02:07:25PM -0500, Andres Freund wrote:
There wasn't
any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
fatal. I think this needs to be treated the same way we treat not being able
to fork checkpointer during a shutdown. Now receiving
PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
FatalError processing after logging "WAL was shut down unexpectedly".I wonder if we could first ask the postmaster a confirmation that the SIGINT is
coming from it? (and if not, then simply ignore the SIGINT). I thought about a
shared memory flag but the postmaster has to be keept away from shared memory
operations, so that's not a valid idea. Another idea could be that the checkpointer
sends a new "confirm" request to the postmater first. But then I'm not sure how
the postmaster could reply back (given the signals that are already being used)
and that might overcomplicate things.
That sounds more complicated than it's worth it for a hypothetical risk.
Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
thought it'd be better to shut checkpointer down after even dead-end children
exited, in case we ever wanted to introduce stats for dead-end children - but
that seems rather unlikely?Yeah, given the above, it looks more clean to change that ordering.
I'll post a version that does that in a bit.
Independent of this patch, we seem to deal with FatalError processing in a
somewhat inconsistently. We have this comment (in master):
/*
* PM_WAIT_DEAD_END state ends when all other children are gone except
* for the logger. During normal shutdown, all that remains are
* dead-end backends, but in FatalError processing we jump straight
* here with more processes remaining. Note that they have already
* been sent appropriate shutdown signals, either during a normal
* state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.However that doesn't actually appear to be true in the most common way to
reach FatalError, via HandleChildCrash():if (Shutdown != ImmediateShutdown)
FatalError = true;/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_WAIT_XLOG_SHUTDOWN)
UpdatePMState(PM_WAIT_BACKENDS);Here we
a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDSFrom PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
other than walsender, archiver and dead-end children exited.b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
PM_WAIT_BACKENDS?This comment seems to have been added in
commit a78af0427015449269fb7d9d8c6057cfcb740149
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-11-14 16:12:28 +0200Assign a child slot to every postmaster child process
ISTM that section of the comment is largely wrong and has never really been
the case if my git sleuthing is correct?I agree that does not look correct but I don't think it's coming from a78af0427015449269fb7d9d8c6057cfcb740149.
ISTM that a78af0427015449269fb7d9d8c6057cfcb740149 re-worded this already wrong
comment:"
/*
* PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
* (ie, no dead_end children remain), and the archiver is gone too.
*
* The reason we wait for those two is to protect them against a new
* postmaster starting conflicting subprocesses; this isn't an
* ironclad protection, but it at least helps in the
* shutdown-and-immediately-restart scenario. Note that they have
* already been sent appropriate shutdown signals, either during a
* normal state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.
*/
"
Hm. This version doesn't really seem wrong in the same way / to the same
degree.
while we could also see:
"
if (Shutdown != ImmediateShutdown)
FatalError = true;/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;
"
I don't think this make the version of the comment you included above
inaccurate? It's saying that the signal has been sent in the state *leading
up" to PM_WAIT_DEAD_END, which does seem accurate? The code the comment is
about just won't be reached until postmaster transition to PM_WAIT_DEAD_END.
but I suspect that's just about unreachable these days. If checkpointer exits
outside of shutdown processing, it's treated as as a reason to crash-restart:
/*
* Any unexpected exit of the checkpointer (including FATAL
* exit) is treated as a crash.
*/
HandleChildCrash(pid, exitstatus,
_("checkpointer process"));I do agree. I was trying to reach this code path while replying to one of
the above comments (about PM_WAIT_DEAD_END ordering) but had to replace with
another if condition to reach it during testing.
I was able to reach it unfortunately
1) Repeated fork failures of checkpointer can lead to it
2) When crash-restarting we, don't start checkpointer immediately, but defer
that until LaunchMissingBackgroundProcesses(). While not easy, it's
possible to trigger a shutdown before that happens.
Greetings,
Andres Freund
Hi,
On 2025-01-13 12:20:39 +0000, Bertrand Drouvot wrote:
We have
multiple copies of code to go to FatalError, that doesn't seem great.+ * FIXME: This should probably not have a copy fo the code to + * transition into FatalError state. + */ + ereport(LOG, + (errmsg("WAL was shut down unexpectedly")));Indeed, might be worth extracting this into a helper function or such.
That is quite a rats nest of issues :(. There are quite a few oddities around
the related code. The attached series contains a few commits trying to unify
the paths transitioning to FatalError = true into the new HandleFatalError().
The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.
I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.
After the earlier commit that just moves the code I turned the existing if ()
into a switch so that whoever adds new states is told to look at that code by
the compiler.
Thoughts?
Another thing I started worrying about is that the commit added
PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
where we directly jump to PM_WAIT_DEAD_END in fatal situations:
Attached patch is implemented that way.
Greetings,
Andres Freund
Attachments:
v4-0001-checkpointer-Request-checkpoint-via-latch-instead.patchtext/x-diff; charset=us-asciiDownload+19-42
v4-0002-postmaster-Don-t-open-code-TerminateChildren-in-H.patchtext/x-diff; charset=us-asciiDownload+4-39
v4-0003-postmaster-Don-t-repeatedly-transition-to-crashin.patchtext/x-diff; charset=us-asciiDownload+7-13
v4-0004-postmaster-Move-code-to-switch-into-FatalError-st.patchtext/x-diff; charset=us-asciiDownload+46-25
v4-0005-WIP-postmaster-Commonalize-FatalError-paths.patchtext/x-diff; charset=us-asciiDownload+48-15
v4-0006-postmaster-Adjust-which-processes-we-expect-to-ha.patchtext/x-diff; charset=us-asciiDownload+18-11
v4-0007-Change-shutdown-sequence-to-terminate-checkpointe.patchtext/x-diff; charset=us-asciiDownload+200-73
Hi,
On Tue, Jan 14, 2025 at 12:58:44AM -0500, Andres Freund wrote:
Hi,
On 2025-01-13 12:20:39 +0000, Bertrand Drouvot wrote:
We have
multiple copies of code to go to FatalError, that doesn't seem great.+ * FIXME: This should probably not have a copy fo the code to + * transition into FatalError state. + */ + ereport(LOG, + (errmsg("WAL was shut down unexpectedly")));Indeed, might be worth extracting this into a helper function or such.
That is quite a rats nest of issues :(. There are quite a few oddities around
the related code. The attached series contains a few commits trying to unify
the paths transitioning to FatalError = true into the new HandleFatalError().
Thanks!
The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.
That would probably make more sense, but would be more invasive and would
need careful testing. Worth a XXX comment in the code?
After the earlier commit that just moves the code I turned the existing if ()
into a switch so that whoever adds new states is told to look at that code by
the compiler.
Good idea!
Thoughts?
======== Patch 0001:
LGTM
======== Patch 0002:
It makes use of TerminateChildren() in HandleChildCrash(). TerminateChildren()
follows the exact same logic as the code being replaced (exclude syslogger,
StartupPMChild special case), so LGTM.
======== Patch 0003:
It replaces the take_action boolean flag with an early return.
One comment:
=== 1
+ TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
if (Shutdown != ImmediateShutdown)
FatalError = true;
I think we can avoid this remaining extra check and set FatalError to true
unconditionally (as we already know that Shutdown != ImmediateShutdown as per
the first check in the function).
======== Patch 0004:
One comment:
=== 2
As per comment === 1 above, then we could remove:
if (Shutdown != ImmediateShutdown)
FatalError = true;
from HandleFatalError(). And that would be up to the HandleFatalError() caller
to set FatalError to true (prior calling HandleFatalError()).
HandleFatalError becomes a pure "transition to error state" function then.
Does that make sense to you?
======== Patch 0005:
Comments:
=== 3
If so, then in this change:
- FatalError = true;
- UpdatePMState(PM_WAIT_DEAD_END);
- ConfigurePostmasterWaitSet(false);
-
- /* Kill the walsenders and archiver too */
- SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+ HandleFatalError(PMQUIT_FOR_CRASH, false);
we should keep the "FatalError = true" part.
=== 4
+ * XXX: Is it worth inventing a different PMQUIT value
+ * that signals that the cluster is in a bad state,
+ * without a process having crashed?
*/
That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.
I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.
Before doing so, what do you think about:
1. keeping FatalError = true; as suggeted above
2. put back the UpdatePMState(PM_WAIT_DEAD_END) (prior to the HandleFatalError(
PMQUIT_FOR_CRASH, false) call
3. put the ConfigurePostmasterWaitSet(false) in the PM_WAIT_DEAD_END switch in
HandleFatalError()?
That's not ideal but that way we would keep the "current" behavior (i.e failure
to fork checkpointer transitions through PM_WAIT_DEAD_END) during the time the
"switch directly to PM_DEAD_END" is not implemented?
But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).
======== Patch 0006:
The patch adds archiver and walsender to processes that should exit during
crash/immediate shutdown.
LGTM.
======== Patch 0007:
Comments:
=== 5
commit message says "Change shutdown sequence to terminate checkpointer last",
which is not 100% accurate.
While this one is:
"
Checkpointer now is terminated
after all children other than dead-end ones have been terminated,
"
=== 6 (Nit)
+ * Close down the database.
s/database/database system/?
Was already there before the patch, just suggesting in passing...
=== 7
+ * PM_WAIT_XLOG_ARCHIVAL is in proccess_pm_pmsignal(), in response to
typo: s/proccess_pm_pmsignal/process_pm_pmsignal/
=== 8
+ * Now that everyone important is gone
s/everyone important is/walsenders and archiver are also gone/?
=== 9
+ * in proccess_pm_child_exit().
typo: s/proccess_pm_child_exit/process_pm_child_exit/
=== 10
+ /*
+ * Doesn't seem likely to help to take send_abort_for_crash into
+ * account here.
+ */
+ HandleFatalError(PMQUIT_FOR_CRASH, false);
If comments === 2 and === 3 made sense to you then we should set FatalError
to true here prior the HandleFatalError call.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:
On Tue, Jan 14, 2025 at 12:58:44AM -0500, Andres Freund wrote:
The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.That would probably make more sense, but would be more invasive and would
need careful testing. Worth a XXX comment in the code?
There is, although I guess it could be reformulated a bit.
======== Patch 0003:
It replaces the take_action boolean flag with an early return.
One comment:
=== 1
+ TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
if (Shutdown != ImmediateShutdown)
FatalError = true;I think we can avoid this remaining extra check and set FatalError to true
unconditionally (as we already know that Shutdown != ImmediateShutdown as per
the first check in the function).======== Patch 0004:
One comment:
=== 2
As per comment === 1 above, then we could remove:
if (Shutdown != ImmediateShutdown)
FatalError = true;from HandleFatalError(). And that would be up to the HandleFatalError() caller
to set FatalError to true (prior calling HandleFatalError()).HandleFatalError becomes a pure "transition to error state" function then.
Does that make sense to you?
I don't see what we'd gain from moving FatalError = true separate from
HandleFatalError? All it'll mean is more copied code?
=== 4
+ * XXX: Is it worth inventing a different PMQUIT value + * that signals that the cluster is in a bad state, + * without a process having crashed? */That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.
For now I think I'll leave it as such, as we'd need to have a new message in
quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
differ, the rest seems like it'd largely be duplicated:
case PMQUIT_FOR_CRASH:
/* A crash-and-restart cycle is in progress */
ereport(WARNING_CLIENT_ONLY,
(errcode(ERRCODE_CRASH_SHUTDOWN),
errmsg("terminating connection because of crash of another server process"),
errdetail("The postmaster has commanded this server process to roll back"
" the current transaction and exit, because another"
" server process exited abnormally and possibly corrupted"
" shared memory."),
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
break;
That seems like a separate change. Particularly because I'm just working on
this as part of some nasty three layer deep dependency chain - aio is pretty
big on its own...
I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.Before doing so, what do you think about:
1. keeping FatalError = true; as suggeted above
2. put back the UpdatePMState(PM_WAIT_DEAD_END) (prior to the HandleFatalError(
PMQUIT_FOR_CRASH, false) call
I don't like that, what's the point of having something like HandleFatalError
if we duplicate more code than it saves to each place?
I don't think we need to either, we can just do that in the relevant case
statement in HandleFatalError(). That works, I went back and forth several
times :)
3. put the ConfigurePostmasterWaitSet(false) in the PM_WAIT_DEAD_END switch in
HandleFatalError()?
That one would make sense to me.
But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).
Why would it need to?
=== 8
+ * Now that everyone important is gone
s/everyone important is/walsenders and archiver are also gone/?
I'd rather get away from these lists of specific processes - they end up being
noisy in the git history because the lines get updated to add/remove items (or
updates to them get missed).
Greetings,
Andres Freund
Hi,
On Tue, Jan 14, 2025 at 06:55:03PM -0500, Andres Freund wrote:
Hi,
On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:
On Tue, Jan 14, 2025 at 12:58:44AM -0500, Andres Freund wrote:
The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.That would probably make more sense, but would be more invasive and would
need careful testing. Worth a XXX comment in the code?There is, although I guess it could be reformulated a bit.
Okay, thanks!
I think we can avoid this remaining extra check and set FatalError to true
unconditionally (as we already know that Shutdown != ImmediateShutdown as per
the first check in the function).======== Patch 0004:
One comment:
=== 2
As per comment === 1 above, then we could remove:
if (Shutdown != ImmediateShutdown)
FatalError = true;from HandleFatalError(). And that would be up to the HandleFatalError() caller
to set FatalError to true (prior calling HandleFatalError()).HandleFatalError becomes a pure "transition to error state" function then.
Does that make sense to you?I don't see what we'd gain from moving FatalError = true separate from
HandleFatalError? All it'll mean is more copied code?
The "Shutdown != ImmediateShutdown" test made me think that it might be cases
where we don't want to set FatalError to true in HandleFatalError(), so the
proposal. I can see that that's not the case (which makes fully sense) so I agree
that it's better to set FatalError to true unconditionally in HandleFatalError().
=== 4
+ * XXX: Is it worth inventing a different PMQUIT value + * that signals that the cluster is in a bad state, + * without a process having crashed? */That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.
For now I think I'll leave it as such, as we'd need to have a new message in
quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
differ, the rest seems like it'd largely be duplicated:case PMQUIT_FOR_CRASH:
/* A crash-and-restart cycle is in progress */
ereport(WARNING_CLIENT_ONLY,
(errcode(ERRCODE_CRASH_SHUTDOWN),
errmsg("terminating connection because of crash of another server process"),
errdetail("The postmaster has commanded this server process to roll back"
" the current transaction and exit, because another"
" server process exited abnormally and possibly corrupted"
" shared memory."),
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
break;That seems like a separate change. Particularly because I'm just working on
this as part of some nasty three layer deep dependency chain - aio is pretty
big on its own...
Fair enough. I'll look at the remaining pieces once this stuff goes in.
I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.Before doing so, what do you think about:
1. keeping FatalError = true; as suggeted above
2. put back the UpdatePMState(PM_WAIT_DEAD_END) (prior to the HandleFatalError(
PMQUIT_FOR_CRASH, false) callI don't like that, what's the point of having something like HandleFatalError
if we duplicate more code than it saves to each place?I don't think we need to either, we can just do that in the relevant case
statement in HandleFatalError(). That works, I went back and forth several
times :)
Okay ;-)
3. put the ConfigurePostmasterWaitSet(false) in the PM_WAIT_DEAD_END switch in
HandleFatalError()?That one would make sense to me.
But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).Why would it need to?
Because I thought that TerminateChildren() needs to be called after
ConfigurePostmasterWaitSet(false). If not, could new connections be accepted in
the window between TerminateChildren() and ConfigurePostmasterWaitSet(false)?
=== 8
+ * Now that everyone important is gone
s/everyone important is/walsenders and archiver are also gone/?
I'd rather get away from these lists of specific processes - they end up being
noisy in the git history because the lines get updated to add/remove items (or
updates to them get missed).
Good point, that's more "difficult" to maintain. Though I'm not sure that
"important" is the right wording. "Now that non dead processes have finished"
maybe? Could be seen as a Nit too.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
Attached is a new version of this patchset:
- HandleFatalError now doesn't expect to be called if already in FatalError or
ImmediateShutdown and contains a comment and assertions to that effect
- Fatal errors when in pmState > PM_WAIT_BACKENDS go to PM_WAIT_DEAD_END
directly. The previous version had them go to PM_WAIT_BACKENDS, to save
duplicating one line. The ConfigurePostmasterWaitSet(false) call is now
duplicated, with a comment at both sites mentioning that.
- Typo, comment and commit message fixes/polishing.
Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.
On 2025-01-15 08:38:33 +0000, Bertrand Drouvot wrote:
On Tue, Jan 14, 2025 at 06:55:03PM -0500, Andres Freund wrote:
On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:
=== 4
+ * XXX: Is it worth inventing a different PMQUIT value + * that signals that the cluster is in a bad state, + * without a process having crashed? */That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.
For now I think I'll leave it as such, as we'd need to have a new message in
quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
differ, the rest seems like it'd largely be duplicated:case PMQUIT_FOR_CRASH:
/* A crash-and-restart cycle is in progress */
ereport(WARNING_CLIENT_ONLY,
(errcode(ERRCODE_CRASH_SHUTDOWN),
errmsg("terminating connection because of crash of another server process"),
errdetail("The postmaster has commanded this server process to roll back"
" the current transaction and exit, because another"
" server process exited abnormally and possibly corrupted"
" shared memory."),
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
break;That seems like a separate change. Particularly because I'm just working on
this as part of some nasty three layer deep dependency chain - aio is pretty
big on its own...Fair enough. I'll look at the remaining pieces once this stuff goes in.
Cool. I did try writing the change, but it does indeed seem better as a
separate patch. Besides the error message, it also seems we ought to invent a
different ERRCODE, as neither ERRCODE_ADMIN_SHUTDOWN nor
ERRCODE_CRASH_SHUTDOWN seem appropriate.
Vaguely related and not at all crucial:
quickdie(), in the PMQUIT_FOR_CRASH, does
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
but in case of an immediate *restart* (via pg_ctl restart -m immediate) we'll
not have such hint. postmaster.c doesn't know it's an immediate restart, so
that's not surprising, but it still seems a bit weird. One would hope that
immediate restarts are more common than crashes...
But that would also need to call TerminateChildren() (a second time) again after
ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
the first call).Why would it need to?
Because I thought that TerminateChildren() needs to be called after
ConfigurePostmasterWaitSet(false). If not, could new connections be accepted in
the window between TerminateChildren() and ConfigurePostmasterWaitSet(false)?
I don't think that can happen with the attached patch - the
ConfigurePostmasterWaitSet() is called before HandleFatalError() returns, so
no new connections can be accepted, as that happens only from
ServerLoop()->AcceptConnection().
=== 8
+ * Now that everyone important is gone
s/everyone important is/walsenders and archiver are also gone/?
I'd rather get away from these lists of specific processes - they end up being
noisy in the git history because the lines get updated to add/remove items (or
updates to them get missed).Good point, that's more "difficult" to maintain. Though I'm not sure that
"important" is the right wording. "Now that non dead processes have finished"
maybe? Could be seen as a Nit too.
This specific case I just solved by referencing "the processes mentioned
above"...
Greetings,
Andres Freund
Attachments:
v5-0001-checkpointer-Request-checkpoint-via-latch-instead.patchtext/x-diff; charset=us-asciiDownload+19-42
v5-0002-postmaster-Don-t-open-code-TerminateChildren-in-H.patchtext/x-diff; charset=us-asciiDownload+11-39
v5-0003-postmaster-Don-t-repeatedly-transition-to-crashin.patchtext/x-diff; charset=us-asciiDownload+8-15
v5-0004-postmaster-Move-code-to-switch-into-FatalError-st.patchtext/x-diff; charset=us-asciiDownload+51-24
v5-0005-postmaster-Commonalize-FatalError-paths.patchtext/x-diff; charset=us-asciiDownload+57-20
v5-0006-postmaster-Adjust-which-processes-we-expect-to-ha.patchtext/x-diff; charset=us-asciiDownload+18-11
v5-0007-Change-shutdown-sequence-to-terminate-checkpointe.patchtext/x-diff; charset=us-asciiDownload+202-70
Hi,
On 2025-01-24 15:06:17 -0500, Andres Freund wrote:
Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.
I did push all but the last yesterday and, not seeing any related issues in
the BF, pushed the last one just now. Let's hope the BF stays green(ish)...
Greetings,
Andres
Hi,
On Sat, Jan 25, 2025 at 11:44:54AM -0500, Andres Freund wrote:
Hi,
On 2025-01-24 15:06:17 -0500, Andres Freund wrote:
Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.I did push all but the last yesterday and, not seeing any related issues in
the BF, pushed the last one just now. Let's hope the BF stays green(ish)...
Just did a quick check and that still looks green(ish) (and when it's not,
it does not seem related to this change).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com