Parallel worker hangs while handling errors.

Started by vignesh Calmost 6 years ago28 messageshackers
Jump to latest
#1vignesh C
vignesh21@gmail.com

Hi,

Parallel worker hangs while handling errors.

Analysis:
When there is an error in the parallel worker process, we will call
ereport/elog with the error message. Worker will then jump from
errfinish to setjmp in StartBackgroundWorker function which was set
earlier. Then the worker process will then send the error message
through the shared memory to the leader process. Shared memory size is
ok 16K, if the error message is less than 16K it works fine. If there
is a bigger error message, the worker process will wait for the leader
process to read the message, free up some memory in shared memory and
set the latch. The worker will be waiting at the below back trace:
#4 0x000000000090480c in WaitLatch (latch=0x7f2b39f6b454,
wakeEvents=33, timeout=0, wait_event_info=134217753) at latch.c:368
#5 0x0000000000787c7f in mq_putmessage (msgtype=69 'E', s=0x2f24350
"SERROR", len=230015) at pqmq.c:171
#6 0x000000000078712e in pq_endmessage (buf=0x7ffe721c4370) at pqformat.c:301
#7 0x0000000000ac1749 in send_message_to_frontend (edata=0xfe91a0
<errordata>) at elog.c:3327
#8 0x0000000000abdf5b in EmitErrorReport () at elog.c:1460

Leader process then identifies that there are some messages that need
to be processed, it copies the messages and sets the latch so that the
worker process can copy the remaining message from the below function:
shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
able to receive any signal at this point of time & hangs infinitely
Worker hangs in this case because when the worker is started the
signals will be masked using sigprocmask. Unblocking of signals is
done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
Now due to error handling the worker has jumped to setjmp in
StartBackgroundWorker function. Here the signals are in a blocked
state, hence the signal is not received by the worker process.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.
I could reproduce this issue when there is an error during copy of
toast data using parallel copy, this project is an in-progress
project. I don't have a test case to reproduce on the head. Any
suggestions for a test case on head?
The Attached patch has the fix for the same.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchapplication/x-patch; name=0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchDownload+5-1
#2vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#1)
Re: Parallel worker hangs while handling errors.

On Fri, Jul 3, 2020 at 2:40 PM vignesh C <vignesh21@gmail.com> wrote:

The Attached patch has the fix for the same.

I have added a commitfest entry for this bug:
https://commitfest.postgresql.org/29/2636/

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#3Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#1)
Re: Parallel worker hangs while handling errors.

Parallel worker hangs while handling errors.

When there is an error in the parallel worker process, we will call
ereport/elog with the error message. Worker will then jump from
errfinish to setjmp in StartBackgroundWorker function which was set
earlier. Then the worker process will then send the error message
through the shared memory to the leader process. Shared memory size is
ok 16K, if the error message is less than 16K it works fine.

I reproduced the hang issue with the parallel copy patches[1]/messages/by-id/CALDaNm2-wMYO68vtDuuWO5h4FQCsfm4Pcg5XrzEPtRty1bEM7w@mail.gmail.com. The use
case is as follows - one of the parallel workers tries to report error
to the leader process and as part of the error context it also tries
to send the entire row/tuple data(which is a lot more than 16KB).

The fix provided here solves the above problem, i.e. no hang occurs,
and the entire tuple/row data in the error from worker to leader gets
transferred, see the attachment "testcase.text" for the output.

Apart from that, I also executed the regression tests (make check and
make check-world) on the patch, no issues are observed.

[1]: /messages/by-id/CALDaNm2-wMYO68vtDuuWO5h4FQCsfm4Pcg5XrzEPtRty1bEM7w@mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

testcase.txttext/plain; charset=US-ASCII; name=testcase.txtDownload
#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#1)
Re: Parallel worker hangs while handling errors.

Leader process then identifies that there are some messages that need
to be processed, it copies the messages and sets the latch so that the
worker process can copy the remaining message from the below function:
shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
able to receive any signal at this point of time & hangs infinitely
Worker hangs in this case because when the worker is started the
signals will be masked using sigprocmask. Unblocking of signals is
done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
Now due to error handling the worker has jumped to setjmp in
StartBackgroundWorker function. Here the signals are in a blocked
state, hence the signal is not received by the worker process.

Your analysis looks fine to me.

Adding some more info:

The worker uses SIGUSR1 (with a special shared memory flag
PROCSIG_PARALLEL_MESSAGE) both for error message sharing(from
mq_putmessage) and for parallel worker shutdown(from
ParallelWorkerShutdown).

And yes, the postmaster blocks SIGUSR1 before forking bgworker(in
PostmasterMain with pqinitmask() and PG_SETMASK(&BlockSig)), bgworker
receives the same blocked signal mask for itself and enters
StartBackgroundWorker(), uses sigsetjmp for error handling, and then
goes to ParallelWorkerMain() where few of the signal handers are set
and then BackgroundWorkerUnblockSignals() is called to not block any
signals.

But observe when we did sigsetjmp the state of the signal mask is that
of we received from postmaster which is all signals blocked.

So, now in error cases when the control is jumped to sigsetjmp we
still have the signals blocked(along with SIGUSR1) mask and in the
code path of EmitErrorReport, we do send SIGUSR1 with flag
PROCSIG_PARALLEL_MESSAGE to the leader/backend and wait for the latch
to be set, this happens only if the worker is able to receive back
SIGUSR1 from the leader/backend.

In this reported issue, since SIGUSR1 is blocked at sigsetjmp in
StartBackgroundWorker(), there is no way that the worker process
receiving it from the leader and the latch cannot be set and hence the
hang occurs.

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

Something like below -

if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
sigset_t x;
sigemptyset (&x);
sigaddset(&x, SIGUSR1);
sigprocmask(SIG_UNBLOCK, &x, NULL);

/* Since not using PG_TRY, must reset error stack by hand */
error_context_stack = NULL;

/* Prevent interrupts while cleaning up */
HOLD_INTERRUPTS();

If okay, with the above approach, we can put the above
sigprocmask(SIG_UNBLOCK,..) piece of code(of course generically to
unblock any given signal) in a macro similar to PG_SETMASK() and use
that in all the places wherever EmitErrorReport() is called from
sigsetjmp. We should mind the portability of sigprocmask as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#5vignesh C
vignesh21@gmail.com
In reply to: Bharath Rupireddy (#4)
Re: Parallel worker hangs while handling errors.

Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

I'm not sure if this can occur in other cases.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchDownload+11-1
#6Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#5)
Re: Parallel worker hangs while handling errors.

On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

I'm not sure if this can occur in other cases.

I checked further on this point: Yes, it can't occur for the other
cases, as mq_putmessage() gets only used for parallel
workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()).

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Since we are sure that this hang issue can occur only for parallel
workers, and the change in StartBackgroundWorker's sigsetjmp's block
should only be made for parallel worker cases. And also there can be a
lot of other callbacks execution and processing in proc_exit() for
which we might not need the SIGUSR1 unblocked. So, let's undo the
unblocking right after EmitErrorReport() to not cause any new issues.

Attaching a modified v2 patch: it has the unblocking for only for
parallel workers, undoing it after EmitErrorReport(), and some
adjustments in the comment.

I verified this fix for the parallel copy hang issue. And also make
check and make check-world passes.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchapplication/x-patch; name=v2-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchDownload+26-1
#7vignesh C
vignesh21@gmail.com
In reply to: Bharath Rupireddy (#6)
Re: Parallel worker hangs while handling errors.

On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

I'm not sure if this can occur in other cases.

I checked further on this point: Yes, it can't occur for the other
cases, as mq_putmessage() gets only used for parallel
workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()).

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Since we are sure that this hang issue can occur only for parallel
workers, and the change in StartBackgroundWorker's sigsetjmp's block
should only be made for parallel worker cases. And also there can be a
lot of other callbacks execution and processing in proc_exit() for
which we might not need the SIGUSR1 unblocked. So, let's undo the
unblocking right after EmitErrorReport() to not cause any new issues.

Attaching a modified v2 patch: it has the unblocking for only for
parallel workers, undoing it after EmitErrorReport(), and some
adjustments in the comment.

I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchDownload+35-1
#8Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#7)
Re: Parallel worker hangs while handling errors.

On Sat, Jul 25, 2020 at 7:02 AM vignesh C <vignesh21@gmail.com> wrote:

I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Having a function to unblock selective signals for a bg worker looks good to me.

Few comments:
1. Do we need "worker" as a function argument in
update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
MyBgworkerEntry is a global variable, can't we have a local variable
instead?
2. Instead of update_parallel_worker_sigmask() serving only for
parallel workers, can we make it generic, so that for any bgworker,
given a signal it unblocks it, although there's no current use case
for a bg worker unblocking a single signal other than a parallel
worker doing it for SIGUSR1 for this hang issue. Please note that we
have BackgroundWorkerBlockSignals() and
BackgroundWorkerUnblockSignals().
I slightly modified your function, something like below?

void
BackgroundWorkerUpdateSignalMask(int signum, bool toblock)
{
if (toblock)
sigaddset(&BlockSig, signum);
else
sigdelset(&BlockSig, signum);

PG_SETMASK(&BlockSig);
}

/*to unblock SIGUSR1*/
if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
BackgroundWorkerUpdateSignalMask(SIGUSR1, false);

/*to block SIGUSR1*/
if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
BackgroundWorkerUpdateSignalMask(SIGUSR1, true);

If okay, with the BackgroundWorkerUpdateSignalMask() function, please
note that we might have to add it in bgworker.sgml as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#9vignesh C
vignesh21@gmail.com
In reply to: Bharath Rupireddy (#8)
Re: Parallel worker hangs while handling errors.

Thanks for your comments Bharath.

On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

1. Do we need "worker" as a function argument in
update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
MyBgworkerEntry is a global variable, can't we have a local variable
instead?

Fixed, I have moved the worker check to the caller function.

2. Instead of update_parallel_worker_sigmask() serving only for
parallel workers, can we make it generic, so that for any bgworker,
given a signal it unblocks it, although there's no current use case
for a bg worker unblocking a single signal other than a parallel
worker doing it for SIGUSR1 for this hang issue. Please note that we
have BackgroundWorkerBlockSignals() and
BackgroundWorkerUnblockSignals().

Fixed. I have slightly modified the changes to break into
BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal.
This maintains the consistency similar to
BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals().

If okay, with the BackgroundWorkerUpdateSignalMask() function, please
note that we might have to add it in bgworker.sgml as well.

Included the documentation.

Attached the updated patch for the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchDownload+46-2
#10Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#9)
Re: Parallel worker hangs while handling errors.

On Tue, Jul 28, 2020 at 11:05 AM vignesh C <vignesh21@gmail.com> wrote:

Thanks for your comments Bharath.

On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

1. Do we need "worker" as a function argument in
update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
MyBgworkerEntry is a global variable, can't we have a local variable
instead?

Fixed, I have moved the worker check to the caller function.

2. Instead of update_parallel_worker_sigmask() serving only for
parallel workers, can we make it generic, so that for any bgworker,
given a signal it unblocks it, although there's no current use case
for a bg worker unblocking a single signal other than a parallel
worker doing it for SIGUSR1 for this hang issue. Please note that we
have BackgroundWorkerBlockSignals() and
BackgroundWorkerUnblockSignals().

Fixed. I have slightly modified the changes to break into
BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal.
This maintains the consistency similar to
BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals().

If okay, with the BackgroundWorkerUpdateSignalMask() function, please
note that we might have to add it in bgworker.sgml as well.

Included the documentation.

Attached the updated patch for the same.

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#11Robert Haas
robertmhaas@gmail.com
In reply to: Bharath Rupireddy (#10)
Re: Parallel worker hangs while handling errors.

On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

I don't think I agree with this approach. In particular, I don't
understand the rationale for unblocking only SIGUSR1. Above, Vignesh
says that he feels that unblocking only that signal would be the right
approach, but no reason is given. I have two reasons why I suspect
it's not the right approach. One, it doesn't seem to be what we do
elsewhere; the only existing cases where we have special handling for
particular signals are SIGQUIT and SIGPIPE, and those places have
comments explaining the reason why they are handled in a special way.
Two, SIGUSR1 is used for a LOT of things: look at all the different
cases procsignal_sigusr1_handler() checks. If the intention is to only
allow the things we know are safe, rather than all the signals there
are, I think this coding utterly fails to achieve that - and for
reasons that I don't think are really fixable.

My first idea about how to fix this was just to call
BackgroundWorkerUnblockSignals() before sigsetjmp(), but that doesn't
really work, because ParallelWorkerMain() needs to set the handler for
SIGTERM before unblocking signals. When you really look at it, the
code that does sigsetjmp() in StartBackgroundWorker() is entirely
bogus. The comment says "See notes in postgres.c about the design of
this coding," but if you go read that comment, it says that the point
of using sigsetjmp() is to make sure that signals are unblocked within
the if-block that follows, but the use in bgworker.c actually achieves
exactly the opposite, because signals have not yet been unblocked at
this point. So, whereas the postgres.c code unblocks signals if they
are blocked, this code blocks signals if they are unblocked. Given
that, maybe the right thing to do is to just start the if-block with a
call to BackgroundWorkerUnblockSignals(). Perhaps there's some reason
that would be unsafe, if the failure occurs too early: postgres.c
doesn't unblock signals until after BaseInit() and InitProcess() have
been called, but here an error in those functions would unblock
signals while it's being handled. Off-hand, I don't see why that would
matter, though. In the postgres.c case, there wouldn't be a
PG_exception_stack yet, so we'd end up in the long part of
pg_re_throw(), which basically promotes the ERROR to FATAL but
otherwise does pretty similar things to what this handler does anyway.
So I'm not really sure there's any reason not to just go this way.

Another approach would be to establish a new PG_exception_stack() with
a free sigsetjmp() call and fresh local buffer, inside
ParallelWorkerMain(). It would do the same thing as the existing
handler, but because it would be established after unblocking signals,
sigsetjmp() would behave as desired. This doesn't seem quite as good
to me because I think this pattern might end up getting copied into
many background workers, and it's a bunch of extra code for which I
don't really see a clear need. So at the moment I think a one line fix
in StartBackgroundWorker(), to just unblock signals while handling
errors, looks better.

Adding Alvaro to the CC line, since I think he wrote this code
originally. Not sure if he or anyone else might have an opinion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Robert Haas (#11)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

I don't think I agree with this approach. In particular, I don't
understand the rationale for unblocking only SIGUSR1. Above, Vignesh
says that he feels that unblocking only that signal would be the right
approach, but no reason is given. I have two reasons why I suspect
it's not the right approach. One, it doesn't seem to be what we do
elsewhere; the only existing cases where we have special handling for
particular signals are SIGQUIT and SIGPIPE, and those places have
comments explaining the reason why they are handled in a special way.

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}, if postmaster sends a SIGQUIT while these processes are
doing clean up tasks in sigsetjmp(), it will not be received, and the
postmaster later sends SIGKLL to kill from below places.

/*
* If we already sent SIGQUIT to children and they are slow to shut
* down, it's time to send them SIGKILL. This doesn't happen
* normally, but under certain conditions backends can get stuck while
* shutting down. This is a last measure to get them unwedged.
*
* Note we also do this during recovery from a process crash.
*/
if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
AbortStartTime != 0 &&
(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
{
/* We were gentle with them before. Not anymore */
TerminateChildren(SIGKILL);
/* reset flag so we don't SIGKILL again */
AbortStartTime = 0;
}

Shouldn't we call PG_SETMASK(&BlockSig); to make it effective?

Am I missing anything here?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#13Robert Haas
robertmhaas@gmail.com
In reply to: Bharath Rupireddy (#12)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}

Yeah, maybe so. This code has been around for a long time and I'm not
sure what the thought process behind it was, but I don't see a flaw in
your analysis here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}

Yeah, maybe so. This code has been around for a long time and I'm not
sure what the thought process behind it was, but I don't see a flaw in
your analysis here.

I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

In general, you don't want to accept signals in that area because the
process state may not be fully set up yet. You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery. But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone. Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit. It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.

As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety. I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed). I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.

regards, tom lane

Attachments:

fix-parallel-worker-hang-v5.patchtext/x-diff; charset=us-ascii; name=fix-parallel-worker-hang-v5.patchDownload+8-1
#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other. It could be correct to interpret the
sigdelset call as preparatory work for a future call to
PG_SETMASK(&BlockSig), but AFAICS there are no such calls in the
processes where this incantation exists, so really it just seems to be
a no-op. Furthermore, the comment says that the point of the
sigdelset() is to allow SIGQUIT "at all times," which doesn't square
well with your suggestion that we intended it to take effect only
later.

In general, you don't want to accept signals in that area because the
process state may not be fully set up yet. You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery. But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone. Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit. It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.

The SIGQUIT handler in question contains nothing than a call to
_exit(2) and a long comment explaining why we don't do anything else,
so I think the argument that it has no state dependencies is pretty
well water-tight. Whether or not we've got a problem with timely
SIGQUIT acceptance is much less clear. So it seems to me that the
safer thing to do here would be to unblock the signal. It might gain
something, and it can't really lose anything. Now it's true that the
calculus might change if someone were to modify the behavior of the
SIGQUIT handler in the future, but if they do then it's their job to
think about this stuff. It doesn't seem especially likely for that to
change, anyway. The only reason that the handler for regular backends
does anything other than _exit(2) is that we want to try to let the
client know what happened before we croak, and that concern is
irrelevant for background workers. Doing any other cleanup here is
unsafe and unnecessary.

As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety. I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed). I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.

I am not sure whether the difference between this and v1 matters,
because in postgres.c it's effectively happening inside sigsetjmp, so
the earlier unblock must not be that bad. But I don't mind putting it
in the place you suggest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other.

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is. Anyway, the bottom line is that that code's been like
that for a decade or two without complaints, so I'm disinclined to
mess with it on the strength of nothing much.

regards, tom lane

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other.

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

Anyway, the bottom line is that that code's been like
that for a decade or two without complaints, so I'm disinclined to
mess with it on the strength of nothing much.

Really? Have you reversed your policy of wanting the comments to
accurately describe what the code does?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#17)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked. You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not, but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery. I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

regards, tom lane

#19Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#18)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 2:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked.

I mean, we're just repeating the same points here, but that's not what
the comment says.

You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not,

It's easily checked using 'git grep'.

but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I can't speak to the intent, but I can speak to what the comment says.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery.

I think it's entirely clear that they do not, and I have explained my
reasoning already.

I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

I've also already explained why I don't agree with this perspective.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#18)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 11:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked. You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not, but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery. I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

I think the main point that needs to be thought is that: will any of
the bgwriter, checkpointer, walwriter and walreceiver processes need
to unblock SIGQUIT during their error recovery code paths i.e. in
their respective if (sigsetjmp(local_sigjmp_buf, 1) != 0){....}
stanzas? Currently, SIGQUIT is blocked in the sigsetjmp() stanza.

If the answer is yes: then we must do PG_SETMASK(&BlockSig); :either
right after sigdelset(&BlockSig, SIGQUIT); to allow quickdie() even
before the sigsetjmp() stanza and also in the sigsetjmp() stanza or do
PG_SETMASK(&BlockSig); only inside the sigsetjmp() stanza. The
postmaster sends SIGQUIT in immediate shutdown mode and it gives
children a chance to exit safely, but if the children take longer
time, then it anyways kills them with SIGKILL(note that SIGKILL can
not be handled or ignored by any process).

If the answer is no: let these processes perform clean ups in their
respective sigsetjmp() stanzas, until the postmaster sends SIGKILL if
the clean ups take time. We could have some elaborated comments before
sigdelset(&BlockSig, SIGQUIT); instead of "/* We allow SIGQUIT
(quickdie) at all times */" to avoid confusion.

We must not worry about blocking or unblocking SIGQUIT in these
processes after the sigsetjmp() stanza, as it anyways gets unblocked
by PG_SETMASK(&UnBlockSig); and also no problem if somebody does
PG_SETMASK(&BlockSig); in future as we have already done
sigdelset(&BlockSig, SIGQUIT);.

Can we start a separate thread to discuss this SIGQUIT point to not
sidetrack the main issue "Parallel worker hangs while handling
errors."?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#21vignesh C
vignesh21@gmail.com
In reply to: Robert Haas (#11)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: vignesh C (#21)
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#22)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#27)