Better client reporting for "immediate stop" shutdowns

Started by Tom Laneover 5 years ago14 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Up to now, if you shut down the database with "pg_ctl stop -m immediate"
then clients get a scary message claiming that something has crashed,
because backends can't tell whether the SIGQUIT they got was sent for
a crash-and-restart situation or because of an immediate-stop command.

This isn't great from a fit-and-finish perspective, and it occurs to me
that it's really easy to do better: the postmaster can stick a flag
into shared memory explaining the reason for SIGQUIT. While we don't
like the postmaster touching shared memory, there doesn't seem to be
any need for interlocking or anything like that, so there is no risk
involved that's greater than those already taken by pmsignal.c.

So, here's a very simple proposed patch. Some issues for possible
bikeshedding:

* Up to now, pmsignal.c has only been for child-to-postmaster
communication, so maybe there is some better place to put the
support code. I can't think of one though.

* I chose to report the same message for immediate shutdown as we
already use for SIGTERM (fast shutdown or pg_terminate_backend()).
Should it be different, and if so what?

regards, tom lane

Attachments:

report-immediate-stop-more-clearly-1.patchtext/x-diff; charset=us-ascii; name=report-immediate-stop-more-clearly-1.patchDownload+85-16
#2Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#1)
Re: Better client reporting for "immediate stop" shutdowns

On Tue, Dec 22, 2020 at 3:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Up to now, if you shut down the database with "pg_ctl stop -m immediate"
then clients get a scary message claiming that something has crashed,
because backends can't tell whether the SIGQUIT they got was sent for
a crash-and-restart situation or because of an immediate-stop command.

This isn't great from a fit-and-finish perspective, and it occurs to me
that it's really easy to do better: the postmaster can stick a flag
into shared memory explaining the reason for SIGQUIT. While we don't
like the postmaster touching shared memory, there doesn't seem to be
any need for interlocking or anything like that, so there is no risk
involved that's greater than those already taken by pmsignal.c.

+1 to improve the message.

So, here's a very simple proposed patch. Some issues for possible
bikeshedding:

* Up to now, pmsignal.c has only been for child-to-postmaster
communication, so maybe there is some better place to put the
support code. I can't think of one though.

+1 to have it here as we already have the required shared memory
initialization code to add in new flags there -
PMSignalState->sigquit_reason.

If I'm correct, quickdie() doesn't access any shared memory because
one of the reason we can be in quickdie() is when the shared memory
itself is corrupted(the comment down below on why we don't call
roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
have some problem in accessing the shared memory i.e. + return
PMSignalState->sigquit_reason;?

* I chose to report the same message for immediate shutdown as we
already use for SIGTERM (fast shutdown or pg_terminate_backend()).
Should it be different, and if so what?

AFAIK, errmsg(terminating connection due to administrator command") is
emitted when there's no specific reason. But we know exactly why we
are terminating in this case, how about having an error message like
errmsg("terminating connection due to immediate shutdown request")));
? There are other places where errmsg("terminating connection due to
XXXX reasons"); is used. We also log messages when an immediate
shutdown request is received errmsg("received immediate shutdown
request").

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#3Magnus Hagander
magnus@hagander.net
In reply to: Bharath Rupireddy (#2)
Re: Better client reporting for "immediate stop" shutdowns

On Tue, Dec 22, 2020 at 2:29 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Tue, Dec 22, 2020 at 3:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Up to now, if you shut down the database with "pg_ctl stop -m immediate"
then clients get a scary message claiming that something has crashed,
because backends can't tell whether the SIGQUIT they got was sent for
a crash-and-restart situation or because of an immediate-stop command.

This isn't great from a fit-and-finish perspective, and it occurs to me
that it's really easy to do better: the postmaster can stick a flag
into shared memory explaining the reason for SIGQUIT. While we don't
like the postmaster touching shared memory, there doesn't seem to be
any need for interlocking or anything like that, so there is no risk
involved that's greater than those already taken by pmsignal.c.

+1 to improve the message.

So, here's a very simple proposed patch. Some issues for possible
bikeshedding:

* Up to now, pmsignal.c has only been for child-to-postmaster
communication, so maybe there is some better place to put the
support code. I can't think of one though.

+1 to have it here as we already have the required shared memory
initialization code to add in new flags there -
PMSignalState->sigquit_reason.

If I'm correct, quickdie() doesn't access any shared memory because
one of the reason we can be in quickdie() is when the shared memory
itself is corrupted(the comment down below on why we don't call
roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
have some problem in accessing the shared memory i.e. + return
PMSignalState->sigquit_reason;?

* I chose to report the same message for immediate shutdown as we
already use for SIGTERM (fast shutdown or pg_terminate_backend()).
Should it be different, and if so what?

AFAIK, errmsg(terminating connection due to administrator command") is
emitted when there's no specific reason. But we know exactly why we
are terminating in this case, how about having an error message like
errmsg("terminating connection due to immediate shutdown request")));
? There are other places where errmsg("terminating connection due to
XXXX reasons"); is used. We also log messages when an immediate
shutdown request is received errmsg("received immediate shutdown
request").

+1. I definitely think having this message be different can be useful.

See also the thread about tracking shutdown reasons (connection
statistics) -- not the same thing, but the same concepts apply.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#3)
Re: Better client reporting for "immediate stop" shutdowns

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Dec 22, 2020 at 2:29 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

If I'm correct, quickdie() doesn't access any shared memory because
one of the reason we can be in quickdie() is when the shared memory
itself is corrupted(the comment down below on why we don't call
roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
have some problem in accessing the shared memory i.e. + return
PMSignalState->sigquit_reason;?

It couldn't really have any problem in physically accessing the field;
we never detach from the main shared memory block. The risk that needs
to be thought about is that shared memory contains garbage --- for
example, imagine that a failing process scribbled in the wrong part of
shared memory before crashing. So the hazard here is that there's a
small chance that sigquit_reason will contain the wrong value, which
would cause the patch to print a misleading message, or more likely
not print anything (since I didn't put a default case in that switch).
That seems fine to me. Also, because the sequence of events would be
(1) failing process scribbles and crashes, (2) postmaster updates
sigquit_reason, (3) other child processes examine sigquit_reason,
it's fairly likely that we'd get the right answer even if the field
got clobbered during (1).

There might be an argument for emitting the "unexpected SIGQUIT"
text if we find garbage in sigquit_reason. Any thoughts about that?

AFAIK, errmsg(terminating connection due to administrator command") is
emitted when there's no specific reason. But we know exactly why we
are terminating in this case, how about having an error message like
errmsg("terminating connection due to immediate shutdown request")));
? There are other places where errmsg("terminating connection due to
XXXX reasons"); is used. We also log messages when an immediate
shutdown request is received errmsg("received immediate shutdown
request").

+1. I definitely think having this message be different can be useful.

OK, will use "terminating connection due to immediate shutdown
request".

See also the thread about tracking shutdown reasons (connection
statistics) -- not the same thing, but the same concepts apply.

Hm. I wondered for a bit if that patch could make use of this one
to improve its results. For the specific case of SIGQUIT it seems
moot because we aren't going to let backends send any shutdown
statistics during an emergency stop. But maybe the idea could be
extended to let more-accurate termination reasons be provided in
some other cases.

regards, tom lane

#5Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#4)
Re: Better client reporting for "immediate stop" shutdowns

On Tue, Dec 22, 2020 at 11:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Dec 22, 2020 at 2:29 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

If I'm correct, quickdie() doesn't access any shared memory because
one of the reason we can be in quickdie() is when the shared memory
itself is corrupted(the comment down below on why we don't call
roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
have some problem in accessing the shared memory i.e. + return
PMSignalState->sigquit_reason;?

It couldn't really have any problem in physically accessing the field;
we never detach from the main shared memory block. The risk that needs
to be thought about is that shared memory contains garbage --- for
example, imagine that a failing process scribbled in the wrong part of
shared memory before crashing. So the hazard here is that there's a
small chance that sigquit_reason will contain the wrong value, which
would cause the patch to print a misleading message, or more likely
not print anything (since I didn't put a default case in that switch).
That seems fine to me. Also, because the sequence of events would be
(1) failing process scribbles and crashes, (2) postmaster updates
sigquit_reason, (3) other child processes examine sigquit_reason,
it's fairly likely that we'd get the right answer even if the field
got clobbered during (1).

Hmm.

There might be an argument for emitting the "unexpected SIGQUIT"
text if we find garbage in sigquit_reason. Any thoughts about that?

Although I can't think of any case now, IMHO we can still have a
default case(we may or may not hit it) in the switch with a message
something like "terminating connection due to unexpected SIGQUIT".

AFAIK, errmsg(terminating connection due to administrator command") is
emitted when there's no specific reason. But we know exactly why we
are terminating in this case, how about having an error message like
errmsg("terminating connection due to immediate shutdown request")));
? There are other places where errmsg("terminating connection due to
XXXX reasons"); is used. We also log messages when an immediate
shutdown request is received errmsg("received immediate shutdown
request").

+1. I definitely think having this message be different can be useful.

OK, will use "terminating connection due to immediate shutdown
request".

Thanks.

I don't have any further comments on the patch.

See also the thread about tracking shutdown reasons (connection
statistics) -- not the same thing, but the same concepts apply.

Hm. I wondered for a bit if that patch could make use of this one
to improve its results. For the specific case of SIGQUIT it seems
moot because we aren't going to let backends send any shutdown
statistics during an emergency stop.

Yeah.

But maybe the idea could be extended to let more-accurate termination reasons be provided in
some other cases.

Yeah. For instance, the idea can be extended to the following scenario
- currently for smart and fast shutdown postmaster sends single signal
SIGTERM, so the backend can not know what was the exact reason for the
shutdown. If the postmaster updates the sigterm reason, (the way this
patch does, just before signalling children with SIGTERM), then the
backend would know that information and can report better.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#5)
Re: Better client reporting for "immediate stop" shutdowns

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

On Tue, Dec 22, 2020 at 11:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There might be an argument for emitting the "unexpected SIGQUIT"
text if we find garbage in sigquit_reason. Any thoughts about that?

Although I can't think of any case now, IMHO we can still have a
default case(we may or may not hit it) in the switch with a message
something like "terminating connection due to unexpected SIGQUIT".

I don't really want to add a default case just on speculation. We
generally prefer to avoid writing a default in a switch that's supposed
to cover all values of an enum type, because without the default most C
compilers will warn you if you omit a value, whereas with the default
they won't. Admittedly, it's unlikely someone would add a new
QuitSignalReason and forget to update this code, but still it's not
really project style to do it like that. I don't think there's enough
risk here to go against the style.

Hence, pushed it like that.

regards, tom lane

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#1)
Re: Better client reporting for "immediate stop" shutdowns

Hi,

On 2020-12-21 16:43:33 -0500, Tom Lane wrote:

Up to now, if you shut down the database with "pg_ctl stop -m immediate"
then clients get a scary message claiming that something has crashed,
because backends can't tell whether the SIGQUIT they got was sent for
a crash-and-restart situation or because of an immediate-stop command.

+many

This isn't great from a fit-and-finish perspective, and it occurs to me
that it's really easy to do better: the postmaster can stick a flag
into shared memory explaining the reason for SIGQUIT. While we don't
like the postmaster touching shared memory, there doesn't seem to be
any need for interlocking or anything like that, so there is no risk
involved that's greater than those already taken by pmsignal.c.

So, here's a very simple proposed patch. Some issues for possible
bikeshedding:

* Up to now, pmsignal.c has only been for child-to-postmaster
communication, so maybe there is some better place to put the
support code. I can't think of one though.

Seems fine with me.

* I chose to report the same message for immediate shutdown as we
already use for SIGTERM (fast shutdown or pg_terminate_backend()).
Should it be different, and if so what?

To do better I think we'd have to distinguish the different cases? An
error message like
"terminating connection due to {fast shutdown,immediate shutdown,connection termination} administrator command"
or such could be helpful, but I don't think your patch adds *quite*
enough state?

I'd like to not log all these repeated messages into the server
log. It's quite annoying to have to digg through thousands of lines of
repeated "terminating connection..." lines that add absolutely no
additional information, just because I am shutting down the
server. Similarly, trying to find the reason for a PANIC is often hard
due to all the other messages.

Greetings,

Andres Freund

#8Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#7)
Re: Better client reporting for "immediate stop" shutdowns

On Sat, Dec 26, 2020 at 4:33 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-12-21 16:43:33 -0500, Tom Lane wrote:

* I chose to report the same message for immediate shutdown as we
already use for SIGTERM (fast shutdown or pg_terminate_backend()).
Should it be different, and if so what?

To do better I think we'd have to distinguish the different cases? An
error message like
"terminating connection due to {fast shutdown,immediate shutdown,connection termination} administrator command"
or such could be helpful, but I don't think your patch adds *quite*
enough state?

Currently, for fast shutdown, the "FATAL: terminating connection due
to administrator command" message is shown in server logs per backend.
The idea used for immediate shutdown can be extended to fast shutdown
as well, that is postmaster can set the signal state just before
signalling the backends with SIGTERM and later in ProcessInterrupts()
the status can be checked and report something like "FATAL:
terminating connection due to fast shutdown command".

And for smart shutdown, since the postmaster waits until the normal
backends to go away on their own and no FATAL messages get logged, so
we don't need to set the signal state.

I'd like to not log all these repeated messages into the server
log. It's quite annoying to have to digg through thousands of lines of
repeated "terminating connection..." lines that add absolutely no
additional information, just because I am shutting down the
server. Similarly, trying to find the reason for a PANIC is often hard
due to all the other messages.

Currently, only one "terminating connection due to XXXX"
message(WARNING for immediate shutdown, FATAL for fast shutdown) gets
logged in the server logs per backend, so the number of log messages
for each shutdown depends on the number of active backends plus other
bg workers if any. If we don't want to let each active backend to show
up these messages separately, then how about postmaster (as it anyways
knows what are the active backends it currently has) checking if all
the backends have exited properly and showing only one message,
something like "the active backends are terminated due to XXXX"?

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: Better client reporting for "immediate stop" shutdowns

Andres Freund <andres@anarazel.de> writes:

On 2020-12-21 16:43:33 -0500, Tom Lane wrote:

* I chose to report the same message for immediate shutdown as we
already use for SIGTERM (fast shutdown or pg_terminate_backend()).
Should it be different, and if so what?

[ per upthread, I did already change the SIGQUIT message to specify
"immediate shutdown" ]

To do better I think we'd have to distinguish the different cases? An
error message like
"terminating connection due to {fast shutdown,immediate shutdown,connection termination} administrator command"
or such could be helpful, but I don't think your patch adds *quite*
enough state?

Well, if you want to distinguish different causes for SIGTERM then
you'd need additional mechanism for that. I think we'd have to have
a per-child termination-reason field, since SIGTERM might be sent to
just an individual backend rather than the whole flotilla at once.
I didn't think it was quite worth the trouble --- "administrator command"
seems close enough for both fast shutdown and pg_terminate_backend() ---
but you could certainly argue differently.

I suppose a compromise position could be to let the postmaster export its
"Shutdown" state variable, and then let backends assume that SIGTERM means
fast shutdown or pg_terminate_backend depending on the state of that one
global variable. But it'd be a bit imprecise so I don't really feel it's
more useful than what we have.

I'd like to not log all these repeated messages into the server
log. It's quite annoying to have to digg through thousands of lines of
repeated "terminating connection..."

Hm. That's an orthogonal issue, but certainly worth considering.
There are a couple of levels we could consider:

1. Just make the logged messages less verbose (they certainly don't
need the DETAIL and HINT lines).

2. Suppress the log entries altogether.

I would have been against #2 before this patch, because it'd mean
that a rogue SIGQUIT leaves no clear trace in the log. But with
this patch, we can be fairly sure that we know whether SIGQUIT came
from the postmaster, and then it might be all right to suppress the
log entry altogether when it did.

I'd be happy to write up a patch for either of these, but let's
decide what we want first.

regards, tom lane

#10Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Better client reporting for "immediate stop" shutdowns

Hi,

On 2020-12-26 13:37:15 -0500, Tom Lane wrote:

I suppose a compromise position could be to let the postmaster export its
"Shutdown" state variable, and then let backends assume that SIGTERM means
fast shutdown or pg_terminate_backend depending on the state of that one
global variable. But it'd be a bit imprecise so I don't really feel it's
more useful than what we have.

Fair enough, I think.

I'd like to not log all these repeated messages into the server
log. It's quite annoying to have to digg through thousands of lines of
repeated "terminating connection..."

Hm. That's an orthogonal issue, but certainly worth considering.
There are a couple of levels we could consider:

1. Just make the logged messages less verbose (they certainly don't
need the DETAIL and HINT lines).

2. Suppress the log entries altogether.

I would have been against #2 before this patch, because it'd mean
that a rogue SIGQUIT leaves no clear trace in the log. But with
this patch, we can be fairly sure that we know whether SIGQUIT came
from the postmaster, and then it might be all right to suppress the
log entry altogether when it did.

I'd be happy to write up a patch for either of these, but let's
decide what we want first.

My vote would be #2, with the same reasoning as yours.

Greetings,

Andres Freund

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: Better client reporting for "immediate stop" shutdowns

Andres Freund <andres@anarazel.de> writes:

On 2020-12-26 13:37:15 -0500, Tom Lane wrote:

I'd like to not log all these repeated messages into the server
log. It's quite annoying to have to digg through thousands of lines of
repeated "terminating connection..."

Hm. That's an orthogonal issue, but certainly worth considering.
There are a couple of levels we could consider:
1. Just make the logged messages less verbose (they certainly don't
need the DETAIL and HINT lines).
2. Suppress the log entries altogether.

My vote would be #2, with the same reasoning as yours.

The most straightforward way to do that is to introduce a new error
level. Having to renumber existing levels is a bit of a pain, but
I'm not aware of anything that should break in source-code terms.
We make similar ABI breaks in every major release.

regards, tom lane

Attachments:

suppress-log-spam-for-SIGQUIT-1.patchtext/x-diff; charset=us-ascii; name=suppress-log-spam-for-SIGQUIT-1.patchDownload+21-9
#12Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: Better client reporting for "immediate stop" shutdowns

Hi,

On 2020-12-28 13:25:14 -0500, Tom Lane wrote:

The most straightforward way to do that is to introduce a new error
level. Having to renumber existing levels is a bit of a pain, but
I'm not aware of anything that should break in source-code terms.
We make similar ABI breaks in every major release.

I don't see a problem either.

/* Select default errcode based on elevel */
if (elevel >= ERROR)
edata->sqlerrcode = ERRCODE_INTERNAL_ERROR;
- else if (elevel == WARNING)
+ else if (elevel >= WARNING)
edata->sqlerrcode = ERRCODE_WARNING;
else
edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;

@@ -2152,6 +2157,7 @@ write_eventlog(int level, const char *line, int len)
eventlevel = EVENTLOG_INFORMATION_TYPE;
break;
case WARNING:
+ case WARNING_CLIENT_ONLY:
eventlevel = EVENTLOG_WARNING_TYPE;
break;
case ERROR:
[...]

I don't think it needs to be done right now, but I again want to suggest
it'd be nice if we split log levels into a bitmask. If we bits, separate
from the log level, for do-not-log-to-client and do-not-log-to-server
some of this code would imo look nicer.

Looks good to me.

Greetings,

Andres Freund

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: Better client reporting for "immediate stop" shutdowns

Andres Freund <andres@anarazel.de> writes:

I don't think it needs to be done right now, but I again want to suggest
it'd be nice if we split log levels into a bitmask. If we bits, separate
from the log level, for do-not-log-to-client and do-not-log-to-server
some of this code would imo look nicer.

Hmm, maybe. I agree that would be better done as a separate patch though.

I had a thought while looking at elog.c: we could further reduce the risk
of quickdie() crashing if we make it do what elog.c does when it gets into
error recursion trouble:

error_context_stack = NULL;
debug_query_string = NULL;

Not invoking error context callbacks would significantly reduce the
footprint of code that can be reached from quickdie's ereports, and
the current call stack isn't really relevant to a report of SIGQUIT
anyway. The argument for not reporting debug_query_string is a little
thinner, but if that string is long it could result in additional
palloc work inside elog.c, thus increasing the amount of stuff that
has to work to get the report out.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: Better client reporting for "immediate stop" shutdowns

I wrote:

I had a thought while looking at elog.c: we could further reduce the risk
of quickdie() crashing if we make it do what elog.c does when it gets into
error recursion trouble:
error_context_stack = NULL;
debug_query_string = NULL;

On closer inspection, there's not much need to touch debug_query_string
here, because elog.c only consults that for making log entries, which
we're suppressing. I pushed it with just the error_context_stack reset.

regards, tom lane