Why does logical replication launcher exit with exit code 1?

Started by Thomas Munroover 8 years ago20 messages
#1Thomas Munro
thomas.munro@enterprisedb.com

Hi,

When I shut down a cluster that isn't using logical replication, it
always logs a line like the following. So do the build farm members I
looked at. I didn't see anything about this in the open items list --
isn't it a bug?

2017-08-02 10:39:25.007 NZST [34781] LOG: worker process: logical
replication launcher (PID 34788) exited with exit code 1

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#1)
Re: Why does logical replication launcher exit with exit code 1?

Hi,

On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:

When I shut down a cluster that isn't using logical replication, it
always logs a line like the following. So do the build farm members I
looked at. I didn't see anything about this in the open items list --
isn't it a bug?

2017-08-02 10:39:25.007 NZST [34781] LOG: worker process: logical
replication launcher (PID 34788) exited with exit code 1

Exit code 0 signals that a worker should be restarted. Therefore
graceful exit can't really use that. I think a) we really need to
improve bgworker infrastructure around that b) shows the limit of using
bgworkers for this kinda thing - we should probably have a more bgworker
like infrastructure for internal workers.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Andres Freund (#2)
Re: Why does logical replication launcher exit with exit code 1?

On 02/08/17 11:03, Andres Freund wrote:

Hi,

On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:

When I shut down a cluster that isn't using logical replication, it
always logs a line like the following. So do the build farm members I
looked at. I didn't see anything about this in the open items list --
isn't it a bug?

2017-08-02 10:39:25.007 NZST [34781] LOG: worker process: logical
replication launcher (PID 34788) exited with exit code 1

Exit code 0 signals that a worker should be restarted. Therefore
graceful exit can't really use that. I think a) we really need to
improve bgworker infrastructure around that b) shows the limit of using
bgworkers for this kinda thing - we should probably have a more bgworker
like infrastructure for internal workers.

- Andres

Returning zero to indicate success is a holdover to the time computers
could only run one program at a time. At the end of the code there was
a jump table of 4 byte entries. The first entry with a displacement of
zero was the location to jump to for a normal exit, subsequent entries
where for various error conditions. This why often return codes where
in positive multiples of 4, since we don't use jump tables now - more &
more people are using any integers they want.

So apart from convention, returning zero is no longer held to be a
sacred to indicate something exited okay. In fact since, zero could
simply mean a value was not explicitly assigned, hence it is actually a
very dangerous value to be used to indicate success!

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@anarazel.de
In reply to: Gavin Flower (#3)
Re: Why does logical replication launcher exit with exit code 1?

On 2017-08-02 11:19:39 +1200, Gavin Flower wrote:

Returning zero to indicate success is a holdover to the time computers could
only run one program at a time. At the end of the code there was a jump
table of 4 byte entries. The first entry with a displacement of zero was
the location to jump to for a normal exit, subsequent entries where for
various error conditions. This why often return codes where in positive
multiples of 4, since we don't use jump tables now - more & more people are
using any integers they want.

So apart from convention, returning zero is no longer held to be a sacred to
indicate something exited okay. In fact since, zero could simply mean a
value was not explicitly assigned, hence it is actually a very dangerous
value to be used to indicate success!

This has nothing to do with this thread.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#2)
Re: Why does logical replication launcher exit with exit code 1?

On Wed, Aug 2, 2017 at 11:03 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:

When I shut down a cluster that isn't using logical replication, it
always logs a line like the following. So do the build farm members I
looked at. I didn't see anything about this in the open items list --
isn't it a bug?

2017-08-02 10:39:25.007 NZST [34781] LOG: worker process: logical
replication launcher (PID 34788) exited with exit code 1

Exit code 0 signals that a worker should be restarted. Therefore
graceful exit can't really use that. I think a) we really need to
improve bgworker infrastructure around that b) shows the limit of using
bgworkers for this kinda thing - we should probably have a more bgworker
like infrastructure for internal workers.

I see. In the meantime IMHO I think we should try to find a way to
avoid printing out this message -- it looks like something is wrong to
the uninitiated.

Possibly stupid question: why do we restart workers when we know we're
shutting down anyway? Hmm, I suppose there might conceivably be
workers that need to do something during shutdown and they might not
have done it yet.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#5)
Re: Why does logical replication launcher exit with exit code 1?

On 2017-08-02 12:14:18 +1200, Thomas Munro wrote:

On Wed, Aug 2, 2017 at 11:03 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:

When I shut down a cluster that isn't using logical replication, it
always logs a line like the following. So do the build farm members I
looked at. I didn't see anything about this in the open items list --
isn't it a bug?

2017-08-02 10:39:25.007 NZST [34781] LOG: worker process: logical
replication launcher (PID 34788) exited with exit code 1

Exit code 0 signals that a worker should be restarted. Therefore
graceful exit can't really use that. I think a) we really need to
improve bgworker infrastructure around that b) shows the limit of using
bgworkers for this kinda thing - we should probably have a more bgworker
like infrastructure for internal workers.

I see. In the meantime IMHO I think we should try to find a way to
avoid printing out this message -- it looks like something is wrong to
the uninitiated.

Well, that's how it is for all bgworkers - maybe a better solution is to
adjust that message in the postmaster rather than fiddle with the worker
exist code? Seems like we could easily take pmStatus into account
inside LogChildExit() and set the log level to DEBUG1 even for
EXIT_STATUS_1 in that case? Additionally we probably should always log
a better message for bgworkers exiting with exit 1, something about
unregistering the worker or such.

Possibly stupid question: why do we restart workers when we know we're
shutting down anyway? Hmm, I suppose there might conceivably be
workers that need to do something during shutdown and they might not
have done it yet.

The launcher doesn't really know the reason for the shutdown.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: Why does logical replication launcher exit with exit code 1?

On Tue, Aug 1, 2017 at 7:03 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:

When I shut down a cluster that isn't using logical replication, it
always logs a line like the following. So do the build farm members I
looked at. I didn't see anything about this in the open items list --
isn't it a bug?

2017-08-02 10:39:25.007 NZST [34781] LOG: worker process: logical
replication launcher (PID 34788) exited with exit code 1

Exit code 0 signals that a worker should be restarted. Therefore
graceful exit can't really use that. I think a) we really need to
improve bgworker infrastructure around that b) shows the limit of using
bgworkers for this kinda thing - we should probably have a more bgworker
like infrastructure for internal workers.

You might've missed commit be7558162acc5578d0b2cf0c8d4c76b6076ce352.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: Why does logical replication launcher exit with exit code 1?

On 2017-08-01 20:37:07 -0400, Robert Haas wrote:

On Tue, Aug 1, 2017 at 7:03 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:

When I shut down a cluster that isn't using logical replication, it
always logs a line like the following. So do the build farm members I
looked at. I didn't see anything about this in the open items list --
isn't it a bug?

2017-08-02 10:39:25.007 NZST [34781] LOG: worker process: logical
replication launcher (PID 34788) exited with exit code 1

Exit code 0 signals that a worker should be restarted. Therefore
graceful exit can't really use that. I think a) we really need to
improve bgworker infrastructure around that b) shows the limit of using
bgworkers for this kinda thing - we should probably have a more bgworker
like infrastructure for internal workers.

You might've missed commit be7558162acc5578d0b2cf0c8d4c76b6076ce352.

Not really, just thinko-ing it. We don't want to unregister, so we can't
return 0, IOW, I just * -1'd my comment ;)

We intentionally return 1, so we *do* get restarted:
else if (IsLogicalLauncher())
{
ereport(DEBUG1,
(errmsg("logical replication launcher shutting down")));

/*
* The logical replication launcher can be stopped at any time.
* Use exit status 1 so the background worker is restarted.
*/
proc_exit(1);
}

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Andres Freund (#6)
Re: Why does logical replication launcher exit with exit code 1?

On 8/1/17 20:20, Andres Freund wrote:

Well, that's how it is for all bgworkers - maybe a better solution is to
adjust that message in the postmaster rather than fiddle with the worker
exist code? Seems like we could easily take pmStatus into account
inside LogChildExit() and set the log level to DEBUG1 even for
EXIT_STATUS_1 in that case? Additionally we probably should always log
a better message for bgworkers exiting with exit 1, something about
unregistering the worker or such.

Maybe it doesn't need to be logged at all (other than perhaps as DEBUG)?
A few months ago, people were complaining about too many messages about
background workers starting. Now we are having complaints about
messages about background workers stopping.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#9)
Re: Why does logical replication launcher exit with exit code 1?

On Wed, Aug 2, 2017 at 4:38 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Maybe it doesn't need to be logged at all (other than perhaps as DEBUG)?
A few months ago, people were complaining about too many messages about
background workers starting. Now we are having complaints about
messages about background workers stopping.

I actually don't think it's that unreasonable to get notified when
system-wide processes like the autovacuum launcher or the logical
replication launcher start or stop. That's stuff somebody might want
to know. It's not going to generate a lot of log volume, and it might
be useful, so why suppress it?

Where things get ugly is if you start to get a high rate of messages -
e.g. from starting and stopping parallel query workers or other kinds
of things where you might have workers starting and stopping very
frequently. But surely this isn't an example of that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#10)
Re: Why does logical replication launcher exit with exit code 1?

On 2017-08-02 16:52:01 -0400, Robert Haas wrote:

On Wed, Aug 2, 2017 at 4:38 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Maybe it doesn't need to be logged at all (other than perhaps as DEBUG)?
A few months ago, people were complaining about too many messages about
background workers starting. Now we are having complaints about
messages about background workers stopping.

I actually don't think it's that unreasonable to get notified when
system-wide processes like the autovacuum launcher or the logical
replication launcher start or stop. That's stuff somebody might want
to know. It's not going to generate a lot of log volume, and it might
be useful, so why suppress it?

Where things get ugly is if you start to get a high rate of messages -
e.g. from starting and stopping parallel query workers or other kinds
of things where you might have workers starting and stopping very
frequently. But surely this isn't an example of that.

I generally agree. But in the shutdown case it's just useless and
confusing - the launcher is stopping because the entire server is being
stopped, and that's very much not clear from the message.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: Why does logical replication launcher exit with exit code 1?

Andres Freund <andres@anarazel.de> writes:

On 2017-08-02 16:52:01 -0400, Robert Haas wrote:

I actually don't think it's that unreasonable to get notified when
system-wide processes like the autovacuum launcher or the logical
replication launcher start or stop. That's stuff somebody might want
to know. It's not going to generate a lot of log volume, and it might
be useful, so why suppress it?

I generally agree. But in the shutdown case it's just useless and
confusing - the launcher is stopping because the entire server is being
stopped, and that's very much not clear from the message.

Yes. The main complaint here is not about the existence of the message
but about the fact that it looks like it's reporting a failure. I would
vote for removing it if it's not simple to fix that problem.

Even if we fix that, though, I think it is reasonable to downgrade it to
DEBUG1. We did that already for other standard background processes such
as the autovac launcher, and it's not apparent to me why the logrep
launcher should be chattier. Now, *unexpected* starts or stops should
certainly deserve a higher log rating ... but the general rule ought to be
that totally-expected behavior does not deserve a log entry by default.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#12)
Re: Why does logical replication launcher exit with exit code 1?

On 2017-08-02 17:09:10 -0400, Tom Lane wrote:

Even if we fix that, though, I think it is reasonable to downgrade it to
DEBUG1. We did that already for other standard background processes such
as the autovac launcher, and it's not apparent to me why the logrep
launcher should be chattier. Now, *unexpected* starts or stops should
certainly deserve a higher log rating ... but the general rule ought to be
that totally-expected behavior does not deserve a log entry by default.

Well, every exit *other* than during a shutdown is unexpected. That's
why I suggested changing the log level for exit code 1 depending on the
cluster being shut down or not.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Robert Haas (#10)
Re: Why does logical replication launcher exit with exit code 1?

On 8/2/17 16:52, Robert Haas wrote:

I actually don't think it's that unreasonable to get notified when
system-wide processes like the autovacuum launcher or the logical
replication launcher start or stop.

But we got rid of those start messages recently due to complaints.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#14)
Re: Why does logical replication launcher exit with exit code 1?

On Wed, Aug 2, 2017 at 9:33 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 8/2/17 16:52, Robert Haas wrote:

I actually don't think it's that unreasonable to get notified when
system-wide processes like the autovacuum launcher or the logical
replication launcher start or stop.

But we got rid of those start messages recently due to complaints.

Yeah, true. I'm just talking about what *I* think. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#15)
Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

On Thu, Aug 3, 2017 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 2, 2017 at 9:33 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 8/2/17 16:52, Robert Haas wrote:

I actually don't think it's that unreasonable to get notified when
system-wide processes like the autovacuum launcher or the logical
replication launcher start or stop.

But we got rid of those start messages recently due to complaints.

Yeah, true. I'm just talking about what *I* think. :-)

I still think the current situation is non-ideal. I don't have a
strong view on whether some or all system-wide processes should say
hello and goodbye explicitly in the log, but I do think we need a way
to make that not look like an error condition, and ideally without
special cases for known built-in processes.

I looked into this a bit today, while debugging an extension that runs
a background worker. Here are some assorted approaches to shutdown:

0. The default SIGTERM handler for bgworkers is bgworker_die(), which
generates a FATAL ereport "terminating background worker \"%s\" due to
administrator command", directly in the signal handler (hmm, see also
recent discussions about the legality of similar code in quickdie()).

1. Some processes install their own custom SIGTERM handler that sets
a flag, and use that to break out of their main loop and exit quietly.
Example: autovacuum.c, or the open-source pg_cron extension, and
probably many other things that just want a nice quiet shutdown.

2. Some processes install the standard SIGTERM handler die(), and
then use CHECK_FOR_INTERRUPTS() to break out of their main loop. By
default this looks like "FATAL: terminating connection due to
administrator command". This approach is effectively required if the
main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
loop. For example, a "launcher"-type process that calls
WaitForBackgroundWorkerStartup() could hang forever if it used
approach 1 (ask me how I know).

3. Some system processes generally use approach 2, but have a special
case in ProcessInterrupts() to suppress or alter the usual FATAL
message or behaviour. This includes the logical replication launcher.

A couple of thoughts:

* Extensions that need to use die() (or something else that would be
compatible with CFI() wait loops) should ideally have a way to control
how ProcessInterrupts() reports this to the user, since shutdown is an
expected condition for long-lived processes.

* We really should get rid of that "exited with exit code 1".

--
Thomas Munro
http://www.enterprisedb.com

#17Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#1)
Fwd: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I still think the current situation is non-ideal. I don't have a
strong view on whether some or all system-wide processes should say
hello and goodbye explicitly in the log, but I do think we need a way
to make that not look like an error condition, and ideally without
special cases for known built-in processes.

I looked into this a bit today, while debugging an extension that runs
a background worker. Here are some assorted approaches to shutdown:

0. The default SIGTERM handler for bgworkers is bgworker_die(), which
generates a FATAL ereport "terminating background worker \"%s\" due to
administrator command", directly in the signal handler (hmm, see also
recent discussions about the legality of similar code in quickdie()).

Yeah, I think that code is bad news, for the same reasons discussed
with regard to quickdie().

1. Some processes install their own custom SIGTERM handler that sets
a flag, and use that to break out of their main loop and exit quietly.
Example: autovacuum.c, or the open-source pg_cron extension, and
probably many other things that just want a nice quiet shutdown.

2. Some processes install the standard SIGTERM handler die(), and
then use CHECK_FOR_INTERRUPTS() to break out of their main loop. By
default this looks like "FATAL: terminating connection due to
administrator command". This approach is effectively required if the
main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
loop. For example, a "launcher"-type process that calls
WaitForBackgroundWorkerStartup() could hang forever if it used
approach 1 (ask me how I know).

My experience with background workers has been that approach #1 does
not really work. I mean, if you aren't doing anything complicated it
might be OK, if for example you are executing SQL queries, and you try
to do #1, then your SQL queries don't respond to interrupts. And that
sucks. So I have generally adopted approach #2.

To put that another way, nearly everything can reach
CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
in the case isn't much different from saying that it is required, full
stop.

3. Some system processes generally use approach 2, but have a special
case in ProcessInterrupts() to suppress or alter the usual FATAL
message or behaviour. This includes the logical replication launcher.

Maybe we could replace this by a general-purpose hook. So instead of
the various tests for process types that are there now, we would just
have

if (procdie_hook != NULL)
(*procdie_hook)();

And that hook can do whatever you like (but probably including dying).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#16)
Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

On Fri, Oct 5, 2018 at 12:36 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

* We really should get rid of that "exited with exit code 1".

Robert and I just discussed this subproblem (the original complaint of
this thread) off-list. Our questions are: does anyone actually want
that message from the postmaster in the log, and if not, shouldn't we
just do this?

diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 41de140ae0..b34655b4bd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3193,8 +3193,7 @@ CleanupBackgroundWorker(int pid,
                rw->rw_child_slot = 0;
                ReportBackgroundWorkerExit(&iter);      /* report child death */
-               LogChildExit(EXIT_STATUS_0(exitstatus) ? DEBUG1 : LOG,
-                                        namebuf, pid, exitstatus);
+               LogChildExit(DEBUG1, namebuf, pid, exitstatus);

return true;
}

As for the problem of the behaviour of bgworker processes themselves
on SIGTERM, let's discuss that separately in the other subthread[1]/messages/by-id/CA+TgmobwExL4kNj_eXJxPah_tVQ31N0cYDbUN0FFm6uaY_+X=w@mail.gmail.com
(well, my mail client thinks it's a different thread, but the archives
think it's the same thread with a different subject).

[1]: /messages/by-id/CA+TgmobwExL4kNj_eXJxPah_tVQ31N0cYDbUN0FFm6uaY_+X=w@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

#19Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#17)
2 attachment(s)
Re: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

On Wed, Oct 10, 2018 at 7:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

0. The default SIGTERM handler for bgworkers is bgworker_die(), which
generates a FATAL ereport "terminating background worker \"%s\" due to
administrator command", directly in the signal handler (hmm, see also
recent discussions about the legality of similar code in quickdie()).

Yeah, I think that code is bad news, for the same reasons discussed
with regard to quickdie().

So I suppose we should just remove it, with something like 0002. I'm
a bit uneasy about existing code out there that might be not calling
CFI. OTOH I suspect that a lot of code copied worker_spi.c and
installed its own handler.

1. Some processes install their own custom SIGTERM handler that sets
a flag, and use that to break out of their main loop and exit quietly.
Example: autovacuum.c, or the open-source pg_cron extension, and
probably many other things that just want a nice quiet shutdown.

2. Some processes install the standard SIGTERM handler die(), and
then use CHECK_FOR_INTERRUPTS() to break out of their main loop. By
default this looks like "FATAL: terminating connection due to
administrator command". This approach is effectively required if the
main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
loop. For example, a "launcher"-type process that calls
WaitForBackgroundWorkerStartup() could hang forever if it used
approach 1 (ask me how I know).

My experience with background workers has been that approach #1 does
not really work. I mean, if you aren't doing anything complicated it
might be OK, if for example you are executing SQL queries, and you try
to do #1, then your SQL queries don't respond to interrupts. And that
sucks. So I have generally adopted approach #2.

Maybe src/test/modules/worker_spi.c shouldn't use approach #1 (even if
it might technically be OK for that code)? I think I might have been
guilty of copying that.

To put that another way, nearly everything can reach
CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
in the case isn't much different from saying that it is required, full
stop.

3. Some system processes generally use approach 2, but have a special
case in ProcessInterrupts() to suppress or alter the usual FATAL
message or behaviour. This includes the logical replication launcher.

Maybe we could replace this by a general-purpose hook. So instead of
the various tests for process types that are there now, we would just
have

if (procdie_hook != NULL)
(*procdie_hook)();

And that hook can do whatever you like (but probably including dying).

Ok, patch 0001 is a sketch like that, for discussion.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Add-proc_die_hook-to-customize-die-interrupt-handlin.patchapplication/octet-stream; name=0001-Add-proc_die_hook-to-customize-die-interrupt-handlin.patchDownload
From 591a65113fb94b9d8335a0f0fbc6ab1349b68511 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 10 Oct 2018 13:20:54 +1300
Subject: [PATCH 1/2] Add proc_die_hook to customize die() interrupt handling.

Background workers using the standard die() SIGTERM handler currently
exit with a FATAL error message at their next CHECK_FOR_INTERRUPTS().
Provide a way for any backend to install an alternative handler.

Provide a function proc_die_quietly() that can be installed to exit
without raising an error in that case.

Author: Thomas Munro, based on a suggestion from Robert Haas
Discussion: https://postgr.es/m/CA%2BTgmobwExL4kNj_eXJxPah_tVQ31N0cYDbUN0FFm6uaY_%2BX%3Dw%40mail.gmail.com
---
 src/backend/tcop/postgres.c | 29 +++++++++++++++++++++++++----
 src/include/tcop/tcopprot.h |  2 ++
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e4c6e3d406e..743f8a89a58 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -100,6 +100,9 @@ int			max_stack_depth = 100;
 /* wait N seconds to allow attach from a debugger */
 int			PostAuthDelay = 0;
 
+/* Hook for custom proc die handler. */
+void		(*proc_die_hook)(void);
+
 
 
 /* ----------------
@@ -2989,10 +2992,17 @@ ProcessInterrupts(void)
 					 errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
-		else
-			ereport(FATAL,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("terminating connection due to administrator command")));
+		else if (proc_die_hook)
+		{
+			/*
+			 * Custom proc die callback, which is expected to call proc_exit()
+			 * or ereport(FATAL, ...).
+			 */
+			(*proc_die_hook)();
+		}
+		ereport(FATAL,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("terminating connection due to administrator command")));
 	}
 	if (ClientConnectionLost)
 	{
@@ -3125,6 +3135,17 @@ ProcessInterrupts(void)
 		HandleParallelMessages();
 }
 
+/*
+ * A function installable as proc_die_hook, for backends such as bgworkers
+ * that wish to exit from ProcessInterrupts() quietly after SIGTERM is
+ * received by the standard die() handler.
+ */
+void
+proc_die_quietly(void)
+{
+	proc_exit(0);
+}
+
 
 /*
  * IA64-specific code to fetch the AR.BSP register for stack depth checks.
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 63b4e4864d0..827a0f19dd4 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -34,6 +34,7 @@ extern CommandDest whereToSendOutput;
 extern PGDLLIMPORT const char *debug_query_string;
 extern int	max_stack_depth;
 extern int	PostAuthDelay;
+extern void (*proc_die_hook)(void);
 
 /* GUC-configurable parameters */
 
@@ -65,6 +66,7 @@ extern List *pg_plan_queries(List *querytrees, int cursorOptions,
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
 extern void assign_max_stack_depth(int newval, void *extra);
 
+extern void proc_die_quietly(void);
 extern void die(SIGNAL_ARGS);
 extern void quickdie(SIGNAL_ARGS) pg_attribute_noreturn();
 extern void StatementCancelHandler(SIGNAL_ARGS);
-- 
2.17.1 (Apple Git-112)

0002-Remove-async-signal-unsafe-bgworker_die-function.patchapplication/octet-stream; name=0002-Remove-async-signal-unsafe-bgworker_die-function.patchDownload
From eebbba4924a498a8887259ce1565059f8661e117 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 10 Oct 2018 14:50:53 +1300
Subject: [PATCH 2/2] Remove async-signal-unsafe bgworker_die() function.

The default SIGTERM handler for background workers called ereport()
directly.  Hopefully background workers typically install their own
handler, but for the benefit of those that don't, let's install the
standard die() handler instead.  That defers handling until the next
CHECK_FOR_INTERRUPTS().

Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BTgmobwExL4kNj_eXJxPah_tVQ31N0cYDbUN0FFm6uaY_%2BX%3Dw%40mail.gmail.com
---
 src/backend/postmaster/bgworker.c | 16 +---------------
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d2b695e1462..13f48244e79 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -661,20 +661,6 @@ bgworker_quickdie(SIGNAL_ARGS)
 	_exit(2);
 }
 
-/*
- * Standard SIGTERM handler for background workers
- */
-static void
-bgworker_die(SIGNAL_ARGS)
-{
-	PG_SETMASK(&BlockSig);
-
-	ereport(FATAL,
-			(errcode(ERRCODE_ADMIN_SHUTDOWN),
-			 errmsg("terminating background worker \"%s\" due to administrator command",
-					MyBgworkerEntry->bgw_type)));
-}
-
 /*
  * Standard SIGUSR1 handler for unconnected workers
  *
@@ -751,7 +737,7 @@ StartBackgroundWorker(void)
 		pqsignal(SIGUSR1, bgworker_sigusr1_handler);
 		pqsignal(SIGFPE, SIG_IGN);
 	}
-	pqsignal(SIGTERM, bgworker_die);
+	pqsignal(SIGTERM, die);		/* die at next CFI */
 	pqsignal(SIGHUP, SIG_IGN);
 
 	pqsignal(SIGQUIT, bgworker_quickdie);
-- 
2.17.1 (Apple Git-112)

#20Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#19)
Re: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

On Wed, Oct 10, 2018 at 12:00 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

So I suppose we should just remove it, with something like 0002. I'm
a bit uneasy about existing code out there that might be not calling
CFI. OTOH I suspect that a lot of code copied worker_spi.c and
installed its own handler.

I agree -- I think worker_spi.c has probably been copied a lot, and
that's not a good thing.

Maybe src/test/modules/worker_spi.c shouldn't use approach #1 (even if
it might technically be OK for that code)? I think I might have been
guilty of copying that.

+1. It's not *really* OK unless all of the SQL queries it executes
are super-short and can't possibly block ... which I guess is never
really true of any SQL queries, right?

Maybe we could replace this by a general-purpose hook. So instead of
the various tests for process types that are there now, we would just
have

if (procdie_hook != NULL)
(*procdie_hook)();

And that hook can do whatever you like (but probably including dying).

Ok, patch 0001 is a sketch like that, for discussion.

I was assuming that we'd replace the existing message-selection logic
with customer proc-die handlers. Also, I think we should not indicate
in the comments that the handler is expected to proc_exit() or
ereport(FATAL), because you might want to do something else there,
then return and let the usual thing happen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company