Parallel worker hangs while handling errors.

Started by vignesh Cover 5 years ago28 messages
#1vignesh C
vignesh21@gmail.com
1 attachment(s)

Hi,

Parallel worker hangs while handling errors.

Analysis:
When there is an error in the parallel worker process, we will call
ereport/elog with the error message. Worker will then jump from
errfinish to setjmp in StartBackgroundWorker function which was set
earlier. Then the worker process will then send the error message
through the shared memory to the leader process. Shared memory size is
ok 16K, if the error message is less than 16K it works fine. If there
is a bigger error message, the worker process will wait for the leader
process to read the message, free up some memory in shared memory and
set the latch. The worker will be waiting at the below back trace:
#4 0x000000000090480c in WaitLatch (latch=0x7f2b39f6b454,
wakeEvents=33, timeout=0, wait_event_info=134217753) at latch.c:368
#5 0x0000000000787c7f in mq_putmessage (msgtype=69 'E', s=0x2f24350
"SERROR", len=230015) at pqmq.c:171
#6 0x000000000078712e in pq_endmessage (buf=0x7ffe721c4370) at pqformat.c:301
#7 0x0000000000ac1749 in send_message_to_frontend (edata=0xfe91a0
<errordata>) at elog.c:3327
#8 0x0000000000abdf5b in EmitErrorReport () at elog.c:1460

Leader process then identifies that there are some messages that need
to be processed, it copies the messages and sets the latch so that the
worker process can copy the remaining message from the below function:
shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
able to receive any signal at this point of time & hangs infinitely
Worker hangs in this case because when the worker is started the
signals will be masked using sigprocmask. Unblocking of signals is
done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
Now due to error handling the worker has jumped to setjmp in
StartBackgroundWorker function. Here the signals are in a blocked
state, hence the signal is not received by the worker process.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.
I could reproduce this issue when there is an error during copy of
toast data using parallel copy, this project is an in-progress
project. I don't have a test case to reproduce on the head. Any
suggestions for a test case on head?
The Attached patch has the fix for the same.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchapplication/x-patch; name=0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchDownload
From a2177fdb0896160f209db4eebc5b4d80eb341e42 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Fri, 3 Jul 2020 12:18:55 +0530
Subject: [PATCH] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.
---
 src/backend/postmaster/bgworker.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..9663907 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,11 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * Unblock signals (they were blocked when the postmaster forked us)
+		 */
+		BackgroundWorkerUnblockSignals();
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
-- 
1.8.3.1

#2vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#1)
Re: Parallel worker hangs while handling errors.

On Fri, Jul 3, 2020 at 2:40 PM vignesh C <vignesh21@gmail.com> wrote:

The Attached patch has the fix for the same.

I have added a commitfest entry for this bug:
https://commitfest.postgresql.org/29/2636/

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#3Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#1)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

Parallel worker hangs while handling errors.

When there is an error in the parallel worker process, we will call
ereport/elog with the error message. Worker will then jump from
errfinish to setjmp in StartBackgroundWorker function which was set
earlier. Then the worker process will then send the error message
through the shared memory to the leader process. Shared memory size is
ok 16K, if the error message is less than 16K it works fine.

I reproduced the hang issue with the parallel copy patches[1]/messages/by-id/CALDaNm2-wMYO68vtDuuWO5h4FQCsfm4Pcg5XrzEPtRty1bEM7w@mail.gmail.com. The use
case is as follows - one of the parallel workers tries to report error
to the leader process and as part of the error context it also tries
to send the entire row/tuple data(which is a lot more than 16KB).

The fix provided here solves the above problem, i.e. no hang occurs,
and the entire tuple/row data in the error from worker to leader gets
transferred, see the attachment "testcase.text" for the output.

Apart from that, I also executed the regression tests (make check and
make check-world) on the patch, no issues are observed.

[1]: /messages/by-id/CALDaNm2-wMYO68vtDuuWO5h4FQCsfm4Pcg5XrzEPtRty1bEM7w@mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

testcase.txttext/plain; charset=US-ASCII; name=testcase.txtDownload
#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#1)
Re: Parallel worker hangs while handling errors.

Leader process then identifies that there are some messages that need
to be processed, it copies the messages and sets the latch so that the
worker process can copy the remaining message from the below function:
shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
able to receive any signal at this point of time & hangs infinitely
Worker hangs in this case because when the worker is started the
signals will be masked using sigprocmask. Unblocking of signals is
done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
Now due to error handling the worker has jumped to setjmp in
StartBackgroundWorker function. Here the signals are in a blocked
state, hence the signal is not received by the worker process.

Your analysis looks fine to me.

Adding some more info:

The worker uses SIGUSR1 (with a special shared memory flag
PROCSIG_PARALLEL_MESSAGE) both for error message sharing(from
mq_putmessage) and for parallel worker shutdown(from
ParallelWorkerShutdown).

And yes, the postmaster blocks SIGUSR1 before forking bgworker(in
PostmasterMain with pqinitmask() and PG_SETMASK(&BlockSig)), bgworker
receives the same blocked signal mask for itself and enters
StartBackgroundWorker(), uses sigsetjmp for error handling, and then
goes to ParallelWorkerMain() where few of the signal handers are set
and then BackgroundWorkerUnblockSignals() is called to not block any
signals.

But observe when we did sigsetjmp the state of the signal mask is that
of we received from postmaster which is all signals blocked.

So, now in error cases when the control is jumped to sigsetjmp we
still have the signals blocked(along with SIGUSR1) mask and in the
code path of EmitErrorReport, we do send SIGUSR1 with flag
PROCSIG_PARALLEL_MESSAGE to the leader/backend and wait for the latch
to be set, this happens only if the worker is able to receive back
SIGUSR1 from the leader/backend.

In this reported issue, since SIGUSR1 is blocked at sigsetjmp in
StartBackgroundWorker(), there is no way that the worker process
receiving it from the leader and the latch cannot be set and hence the
hang occurs.

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

Something like below -

if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
sigset_t x;
sigemptyset (&x);
sigaddset(&x, SIGUSR1);
sigprocmask(SIG_UNBLOCK, &x, NULL);

/* Since not using PG_TRY, must reset error stack by hand */
error_context_stack = NULL;

/* Prevent interrupts while cleaning up */
HOLD_INTERRUPTS();

If okay, with the above approach, we can put the above
sigprocmask(SIG_UNBLOCK,..) piece of code(of course generically to
unblock any given signal) in a macro similar to PG_SETMASK() and use
that in all the places wherever EmitErrorReport() is called from
sigsetjmp. We should mind the portability of sigprocmask as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#5vignesh C
vignesh21@gmail.com
In reply to: Bharath Rupireddy (#4)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

I'm not sure if this can occur in other cases.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patchDownload
From b283bdd47e3bc132c1b8ef0ea59d1c12cb8c0ffb Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 23 Jul 2020 12:35:45 +0530
Subject: [PATCH] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.
---
 src/backend/postmaster/bgworker.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..a0af580 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,17 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * Unblock SIGUSR1 signal, it was blocked when the postmaster forker us.
+		 * In case of parallel workers, leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		sigdelset(&BlockSig, SIGUSR1);
+		PG_SETMASK(&BlockSig);
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
-- 
1.8.3.1

#6Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#5)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

I'm not sure if this can occur in other cases.

I checked further on this point: Yes, it can't occur for the other
cases, as mq_putmessage() gets only used for parallel
workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()).

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Since we are sure that this hang issue can occur only for parallel
workers, and the change in StartBackgroundWorker's sigsetjmp's block
should only be made for parallel worker cases. And also there can be a
lot of other callbacks execution and processing in proc_exit() for
which we might not need the SIGUSR1 unblocked. So, let's undo the
unblocking right after EmitErrorReport() to not cause any new issues.

Attaching a modified v2 patch: it has the unblocking for only for
parallel workers, undoing it after EmitErrorReport(), and some
adjustments in the comment.

I verified this fix for the parallel copy hang issue. And also make
check and make check-world passes.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchapplication/x-patch; name=v2-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchDownload
From 6bd924f2e7ff90d6c293f131b8ddb20898b9950d Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddy@enterprisedb.com>
Date: Fri, 24 Jul 2020 12:01:14 +0530
Subject: [PATCH v2] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.
---
 src/backend/postmaster/bgworker.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85434..ee187689a4 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,20 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * In case of parallel workers, unblock SIGUSR1 signal, it was blocked
+		 * when the postmaster forked us. Leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+		{
+			sigdelset(&BlockSig, SIGUSR1);
+			PG_SETMASK(&BlockSig);
+		}
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
@@ -756,6 +770,18 @@ StartBackgroundWorker(void)
 		/* Report the error to the server log */
 		EmitErrorReport();
 
+		/*
+		 * Undo the unblocking of SIGUSR1 which was done above, as to
+		 * not cause any further issues from unblocking SIGUSR1 during
+		 * the execution of callbacks and other processing that will be
+		 * done during proc_exit().
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+		{
+			sigaddset(&BlockSig, SIGUSR1);
+			PG_SETMASK(&BlockSig);
+		}
+
 		/*
 		 * Do we need more cleanup here?  For shmem-connected bgworkers, we
 		 * will call InitProcess below, which will install ProcKill as exit
-- 
2.25.1

#7vignesh C
vignesh21@gmail.com
In reply to: Bharath Rupireddy (#6)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

I'm not sure if this can occur in other cases.

I checked further on this point: Yes, it can't occur for the other
cases, as mq_putmessage() gets only used for parallel
workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()).

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Since we are sure that this hang issue can occur only for parallel
workers, and the change in StartBackgroundWorker's sigsetjmp's block
should only be made for parallel worker cases. And also there can be a
lot of other callbacks execution and processing in proc_exit() for
which we might not need the SIGUSR1 unblocked. So, let's undo the
unblocking right after EmitErrorReport() to not cause any new issues.

Attaching a modified v2 patch: it has the unblocking for only for
parallel workers, undoing it after EmitErrorReport(), and some
adjustments in the comment.

I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchDownload
From b48849d091302dfdac4fc004195b3c532fdb9dcb Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Sat, 25 Jul 2020 06:38:40 +0530
Subject: [PATCH v3] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.

Authors: Vignesh C, Bharath Rupireddy
---
 src/backend/postmaster/bgworker.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..0b9f214 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -671,6 +671,23 @@ bgworker_sigusr1_handler(SIGNAL_ARGS)
 }
 
 /*
+ * update_parallel_worker_sigmask - add or remove a signal from sigmask.
+ */
+static void
+update_parallel_worker_sigmask(BackgroundWorker *worker, int signum,
+							   bool isadd)
+{
+	if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+	{
+		if (isadd)
+			sigaddset(&BlockSig, signum);
+		else
+			sigdelset(&BlockSig, signum);
+		PG_SETMASK(&BlockSig);
+	}
+}
+
+/*
  * Start a new background worker
  *
  * This is the main entry point for background worker, to be called from
@@ -747,6 +764,16 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * In case of parallel workers, unblock SIGUSR1 signal, it was blocked
+		 * when the postmaster forked us. Leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		update_parallel_worker_sigmask(worker, SIGUSR1, false);
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
@@ -757,6 +784,14 @@ StartBackgroundWorker(void)
 		EmitErrorReport();
 
 		/*
+		 * Undo the unblocking of SIGUSR1 which was done above, as to
+		 * not cause any further issues from unblocking SIGUSR1 during
+		 * the execution of callbacks and other processing that will be
+		 * done during proc_exit().
+		 */
+		update_parallel_worker_sigmask(worker, SIGUSR1, true);
+
+		/*
 		 * Do we need more cleanup here?  For shmem-connected bgworkers, we
 		 * will call InitProcess below, which will install ProcKill as exit
 		 * callback.  That will take care of releasing locks, etc.
-- 
1.8.3.1

#8Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#7)
Re: Parallel worker hangs while handling errors.

On Sat, Jul 25, 2020 at 7:02 AM vignesh C <vignesh21@gmail.com> wrote:

I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Having a function to unblock selective signals for a bg worker looks good to me.

Few comments:
1. Do we need "worker" as a function argument in
update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
MyBgworkerEntry is a global variable, can't we have a local variable
instead?
2. Instead of update_parallel_worker_sigmask() serving only for
parallel workers, can we make it generic, so that for any bgworker,
given a signal it unblocks it, although there's no current use case
for a bg worker unblocking a single signal other than a parallel
worker doing it for SIGUSR1 for this hang issue. Please note that we
have BackgroundWorkerBlockSignals() and
BackgroundWorkerUnblockSignals().
I slightly modified your function, something like below?

void
BackgroundWorkerUpdateSignalMask(int signum, bool toblock)
{
if (toblock)
sigaddset(&BlockSig, signum);
else
sigdelset(&BlockSig, signum);

PG_SETMASK(&BlockSig);
}

/*to unblock SIGUSR1*/
if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
BackgroundWorkerUpdateSignalMask(SIGUSR1, false);

/*to block SIGUSR1*/
if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
BackgroundWorkerUpdateSignalMask(SIGUSR1, true);

If okay, with the BackgroundWorkerUpdateSignalMask() function, please
note that we might have to add it in bgworker.sgml as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#9vignesh C
vignesh21@gmail.com
In reply to: Bharath Rupireddy (#8)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

Thanks for your comments Bharath.

On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

1. Do we need "worker" as a function argument in
update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
MyBgworkerEntry is a global variable, can't we have a local variable
instead?

Fixed, I have moved the worker check to the caller function.

2. Instead of update_parallel_worker_sigmask() serving only for
parallel workers, can we make it generic, so that for any bgworker,
given a signal it unblocks it, although there's no current use case
for a bg worker unblocking a single signal other than a parallel
worker doing it for SIGUSR1 for this hang issue. Please note that we
have BackgroundWorkerBlockSignals() and
BackgroundWorkerUnblockSignals().

Fixed. I have slightly modified the changes to break into
BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal.
This maintains the consistency similar to
BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals().

If okay, with the BackgroundWorkerUpdateSignalMask() function, please
note that we might have to add it in bgworker.sgml as well.

Included the documentation.

Attached the updated patch for the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patchDownload
From f06a5966d754d6622a3425c6cac1ad72314832ef Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Tue, 28 Jul 2020 10:48:17 +0530
Subject: [PATCH v4] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.

Authors: Vignesh C, Bharath Rupireddy
---
 doc/src/sgml/bgworker.sgml          |  6 +++++-
 src/backend/postmaster/bgworker.c   | 20 ++++++++++++++++++++
 src/backend/postmaster/postmaster.c | 17 +++++++++++++++++
 src/include/postmaster/bgworker.h   |  4 ++++
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 6e1cf12..d900fc9 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -210,7 +210,11 @@ typedef struct BackgroundWorker
    allow the process to customize its signal handlers, if necessary.
    Signals can be unblocked in the new process by calling
    <function>BackgroundWorkerUnblockSignals</function> and blocked by calling
-   <function>BackgroundWorkerBlockSignals</function>.
+   <function>BackgroundWorkerBlockSignals</function>. A particular signal can
+   be unblocked in the new process by calling
+   <function>BackgroundWorkerRemoveBlockSignal(<parameter>int signum</parameter>)</function>
+   and blocked by calling
+   <function>BackgroundWorkerAddBlockSignal(<parameter>int signum</parameter>)</function>.
   </para>
 
   <para>
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..458b513 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,17 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * In case of parallel workers, unblock SIGUSR1 signal, it was blocked
+		 * when the postmaster forked us. Leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+			BackgroundWorkerRemoveBlockSignal(SIGUSR1);
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
@@ -757,6 +768,15 @@ StartBackgroundWorker(void)
 		EmitErrorReport();
 
 		/*
+		 * Undo the unblocking of SIGUSR1 which was done above, as to
+		 * not cause any further issues from unblocking SIGUSR1 during
+		 * the execution of callbacks and other processing that will be
+		 * done during proc_exit().
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+			BackgroundWorkerAddBlockSignal(SIGUSR1);
+
+		/*
 		 * Do we need more cleanup here?  For shmem-connected bgworkers, we
 		 * will call InitProcess below, which will install ProcKill as exit
 		 * callback.  That will take care of releasing locks, etc.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index dec0258..751cd16 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5765,6 +5765,23 @@ BackgroundWorkerUnblockSignals(void)
 	PG_SETMASK(&UnBlockSig);
 }
 
+/*
+ * Block/unblock a particular signal in a background worker.
+ */
+void
+BackgroundWorkerAddBlockSignal(int signum)
+{
+	sigaddset(&BlockSig, signum);
+	PG_SETMASK(&BlockSig);
+}
+
+void
+BackgroundWorkerRemoveBlockSignal(int signum)
+{
+	sigdelset(&BlockSig, signum);
+	PG_SETMASK(&BlockSig);
+}
+
 #ifdef EXEC_BACKEND
 static pid_t
 bgworker_forkexec(int shmem_slot)
diff --git a/src/include/postmaster/bgworker.h b/src/include/postmaster/bgworker.h
index 4c6ebaa..cb5423b 100644
--- a/src/include/postmaster/bgworker.h
+++ b/src/include/postmaster/bgworker.h
@@ -158,4 +158,8 @@ extern void BackgroundWorkerInitializeConnectionByOid(Oid dboid, Oid useroid, ui
 extern void BackgroundWorkerBlockSignals(void);
 extern void BackgroundWorkerUnblockSignals(void);
 
+/* Block/unblock a particular signal in a background worker process */
+extern void BackgroundWorkerAddBlockSignal(int signum);
+extern void BackgroundWorkerRemoveBlockSignal(int signum);
+
 #endif							/* BGWORKER_H */
-- 
1.8.3.1

#10Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: vignesh C (#9)
Re: Parallel worker hangs while handling errors.

On Tue, Jul 28, 2020 at 11:05 AM vignesh C <vignesh21@gmail.com> wrote:

Thanks for your comments Bharath.

On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

1. Do we need "worker" as a function argument in
update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
MyBgworkerEntry is a global variable, can't we have a local variable
instead?

Fixed, I have moved the worker check to the caller function.

2. Instead of update_parallel_worker_sigmask() serving only for
parallel workers, can we make it generic, so that for any bgworker,
given a signal it unblocks it, although there's no current use case
for a bg worker unblocking a single signal other than a parallel
worker doing it for SIGUSR1 for this hang issue. Please note that we
have BackgroundWorkerBlockSignals() and
BackgroundWorkerUnblockSignals().

Fixed. I have slightly modified the changes to break into
BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal.
This maintains the consistency similar to
BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals().

If okay, with the BackgroundWorkerUpdateSignalMask() function, please
note that we might have to add it in bgworker.sgml as well.

Included the documentation.

Attached the updated patch for the same.

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#11Robert Haas
robertmhaas@gmail.com
In reply to: Bharath Rupireddy (#10)
Re: Parallel worker hangs while handling errors.

On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

I don't think I agree with this approach. In particular, I don't
understand the rationale for unblocking only SIGUSR1. Above, Vignesh
says that he feels that unblocking only that signal would be the right
approach, but no reason is given. I have two reasons why I suspect
it's not the right approach. One, it doesn't seem to be what we do
elsewhere; the only existing cases where we have special handling for
particular signals are SIGQUIT and SIGPIPE, and those places have
comments explaining the reason why they are handled in a special way.
Two, SIGUSR1 is used for a LOT of things: look at all the different
cases procsignal_sigusr1_handler() checks. If the intention is to only
allow the things we know are safe, rather than all the signals there
are, I think this coding utterly fails to achieve that - and for
reasons that I don't think are really fixable.

My first idea about how to fix this was just to call
BackgroundWorkerUnblockSignals() before sigsetjmp(), but that doesn't
really work, because ParallelWorkerMain() needs to set the handler for
SIGTERM before unblocking signals. When you really look at it, the
code that does sigsetjmp() in StartBackgroundWorker() is entirely
bogus. The comment says "See notes in postgres.c about the design of
this coding," but if you go read that comment, it says that the point
of using sigsetjmp() is to make sure that signals are unblocked within
the if-block that follows, but the use in bgworker.c actually achieves
exactly the opposite, because signals have not yet been unblocked at
this point. So, whereas the postgres.c code unblocks signals if they
are blocked, this code blocks signals if they are unblocked. Given
that, maybe the right thing to do is to just start the if-block with a
call to BackgroundWorkerUnblockSignals(). Perhaps there's some reason
that would be unsafe, if the failure occurs too early: postgres.c
doesn't unblock signals until after BaseInit() and InitProcess() have
been called, but here an error in those functions would unblock
signals while it's being handled. Off-hand, I don't see why that would
matter, though. In the postgres.c case, there wouldn't be a
PG_exception_stack yet, so we'd end up in the long part of
pg_re_throw(), which basically promotes the ERROR to FATAL but
otherwise does pretty similar things to what this handler does anyway.
So I'm not really sure there's any reason not to just go this way.

Another approach would be to establish a new PG_exception_stack() with
a free sigsetjmp() call and fresh local buffer, inside
ParallelWorkerMain(). It would do the same thing as the existing
handler, but because it would be established after unblocking signals,
sigsetjmp() would behave as desired. This doesn't seem quite as good
to me because I think this pattern might end up getting copied into
many background workers, and it's a bunch of extra code for which I
don't really see a clear need. So at the moment I think a one line fix
in StartBackgroundWorker(), to just unblock signals while handling
errors, looks better.

Adding Alvaro to the CC line, since I think he wrote this code
originally. Not sure if he or anyone else might have an opinion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Robert Haas (#11)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

I don't think I agree with this approach. In particular, I don't
understand the rationale for unblocking only SIGUSR1. Above, Vignesh
says that he feels that unblocking only that signal would be the right
approach, but no reason is given. I have two reasons why I suspect
it's not the right approach. One, it doesn't seem to be what we do
elsewhere; the only existing cases where we have special handling for
particular signals are SIGQUIT and SIGPIPE, and those places have
comments explaining the reason why they are handled in a special way.

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}, if postmaster sends a SIGQUIT while these processes are
doing clean up tasks in sigsetjmp(), it will not be received, and the
postmaster later sends SIGKLL to kill from below places.

/*
* If we already sent SIGQUIT to children and they are slow to shut
* down, it's time to send them SIGKILL. This doesn't happen
* normally, but under certain conditions backends can get stuck while
* shutting down. This is a last measure to get them unwedged.
*
* Note we also do this during recovery from a process crash.
*/
if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
AbortStartTime != 0 &&
(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
{
/* We were gentle with them before. Not anymore */
TerminateChildren(SIGKILL);
/* reset flag so we don't SIGKILL again */
AbortStartTime = 0;
}

Shouldn't we call PG_SETMASK(&BlockSig); to make it effective?

Am I missing anything here?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#13Robert Haas
robertmhaas@gmail.com
In reply to: Bharath Rupireddy (#12)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}

Yeah, maybe so. This code has been around for a long time and I'm not
sure what the thought process behind it was, but I don't see a flaw in
your analysis here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}

Yeah, maybe so. This code has been around for a long time and I'm not
sure what the thought process behind it was, but I don't see a flaw in
your analysis here.

I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

In general, you don't want to accept signals in that area because the
process state may not be fully set up yet. You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery. But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone. Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit. It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.

As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety. I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed). I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.

regards, tom lane

Attachments:

fix-parallel-worker-hang-v5.patchtext/x-diff; charset=us-ascii; name=fix-parallel-worker-hang-v5.patchDownload
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85434..dd22241600 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -753,7 +753,14 @@ StartBackgroundWorker(void)
 		/* Prevent interrupts while cleaning up */
 		HOLD_INTERRUPTS();
 
-		/* Report the error to the server log */
+		/*
+		 * sigsetjmp will have blocked all signals, but we may need to accept
+		 * signals while communicating with our parallel leader.  Once we've
+		 * done HOLD_INTERRUPTS() it should be safe to unblock signals.
+		 */
+		BackgroundWorkerUnblockSignals();
+
+		/* Report the error to the parallel leader and the server log */
 		EmitErrorReport();
 
 		/*
#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other. It could be correct to interpret the
sigdelset call as preparatory work for a future call to
PG_SETMASK(&BlockSig), but AFAICS there are no such calls in the
processes where this incantation exists, so really it just seems to be
a no-op. Furthermore, the comment says that the point of the
sigdelset() is to allow SIGQUIT "at all times," which doesn't square
well with your suggestion that we intended it to take effect only
later.

In general, you don't want to accept signals in that area because the
process state may not be fully set up yet. You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery. But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone. Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit. It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.

The SIGQUIT handler in question contains nothing than a call to
_exit(2) and a long comment explaining why we don't do anything else,
so I think the argument that it has no state dependencies is pretty
well water-tight. Whether or not we've got a problem with timely
SIGQUIT acceptance is much less clear. So it seems to me that the
safer thing to do here would be to unblock the signal. It might gain
something, and it can't really lose anything. Now it's true that the
calculus might change if someone were to modify the behavior of the
SIGQUIT handler in the future, but if they do then it's their job to
think about this stuff. It doesn't seem especially likely for that to
change, anyway. The only reason that the handler for regular backends
does anything other than _exit(2) is that we want to try to let the
client know what happened before we croak, and that concern is
irrelevant for background workers. Doing any other cleanup here is
unsafe and unnecessary.

As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety. I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed). I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.

I am not sure whether the difference between this and v1 matters,
because in postgres.c it's effectively happening inside sigsetjmp, so
the earlier unblock must not be that bad. But I don't mind putting it
in the place you suggest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other.

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is. Anyway, the bottom line is that that code's been like
that for a decade or two without complaints, so I'm disinclined to
mess with it on the strength of nothing much.

regards, tom lane

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other.

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

Anyway, the bottom line is that that code's been like
that for a decade or two without complaints, so I'm disinclined to
mess with it on the strength of nothing much.

Really? Have you reversed your policy of wanting the comments to
accurately describe what the code does?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#17)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked. You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not, but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery. I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

regards, tom lane

#19Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#18)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 2:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked.

I mean, we're just repeating the same points here, but that's not what
the comment says.

You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not,

It's easily checked using 'git grep'.

but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I can't speak to the intent, but I can speak to what the comment says.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery.

I think it's entirely clear that they do not, and I have explained my
reasoning already.

I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

I've also already explained why I don't agree with this perspective.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#18)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 11:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked. You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not, but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery. I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

I think the main point that needs to be thought is that: will any of
the bgwriter, checkpointer, walwriter and walreceiver processes need
to unblock SIGQUIT during their error recovery code paths i.e. in
their respective if (sigsetjmp(local_sigjmp_buf, 1) != 0){....}
stanzas? Currently, SIGQUIT is blocked in the sigsetjmp() stanza.

If the answer is yes: then we must do PG_SETMASK(&BlockSig); :either
right after sigdelset(&BlockSig, SIGQUIT); to allow quickdie() even
before the sigsetjmp() stanza and also in the sigsetjmp() stanza or do
PG_SETMASK(&BlockSig); only inside the sigsetjmp() stanza. The
postmaster sends SIGQUIT in immediate shutdown mode and it gives
children a chance to exit safely, but if the children take longer
time, then it anyways kills them with SIGKILL(note that SIGKILL can
not be handled or ignored by any process).

If the answer is no: let these processes perform clean ups in their
respective sigsetjmp() stanzas, until the postmaster sends SIGKILL if
the clean ups take time. We could have some elaborated comments before
sigdelset(&BlockSig, SIGQUIT); instead of "/* We allow SIGQUIT
(quickdie) at all times */" to avoid confusion.

We must not worry about blocking or unblocking SIGQUIT in these
processes after the sigsetjmp() stanza, as it anyways gets unblocked
by PG_SETMASK(&UnBlockSig); and also no problem if somebody does
PG_SETMASK(&BlockSig); in future as we have already done
sigdelset(&BlockSig, SIGQUIT);.

Can we start a separate thread to discuss this SIGQUIT point to not
sidetrack the main issue "Parallel worker hangs while handling
errors."?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#21vignesh C
vignesh21@gmail.com
In reply to: Robert Haas (#11)
Re: Parallel worker hangs while handling errors.

On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

I don't think I agree with this approach. In particular, I don't
understand the rationale for unblocking only SIGUSR1. Above, Vignesh
says that he feels that unblocking only that signal would be the right
approach, but no reason is given. I have two reasons why I suspect
it's not the right approach. One, it doesn't seem to be what we do
elsewhere; the only existing cases where we have special handling for
particular signals are SIGQUIT and SIGPIPE, and those places have
comments explaining the reason why they are handled in a special way.
Two, SIGUSR1 is used for a LOT of things: look at all the different
cases procsignal_sigusr1_handler() checks. If the intention is to only
allow the things we know are safe, rather than all the signals there
are, I think this coding utterly fails to achieve that - and for
reasons that I don't think are really fixable.

My intention of blocking only SIGUSR1 over unblocking all signals
mainly because we are already in the error path and we are about to
exit after emitting the error report. I was not sure if we intended to
receive any other signal just before exiting.
The Solution Robert & Tom are suggesting by Calling
BackgroundWorkerUnblockSignals fixes the actual problem.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: vignesh C (#21)
Re: Parallel worker hangs while handling errors.

vignesh C <vignesh21@gmail.com> writes:

The Solution Robert & Tom are suggesting by Calling
BackgroundWorkerUnblockSignals fixes the actual problem.

I've gone ahead and pushed the bgworker fix, since everyone seems
to agree that that's okay, and it is provably fixing a problem.

As for the question of SIGQUIT handling, I see that postgres.c
does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset
change, so there probably isn't any harm in having the background
processes do likewise. I wonder though why bgworkers are not
applying the same policy. (I remain of the opinion that any
changes in this area should not be back-patched without evidence
of a concrete problem; it's at least as likely that we'll introduce
a problem as fix one.)

regards, tom lane

#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#22)
Re: Parallel worker hangs while handling errors.

On 2020-Sep-03, Tom Lane wrote:

As for the question of SIGQUIT handling, I see that postgres.c
does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset
change, so there probably isn't any harm in having the background
processes do likewise. I wonder though why bgworkers are not
applying the same policy.

It's quite likely that it's the way it is more by accident than because
I was thinking extremely carefully about signal handling when originally
writing that code. Some parts of that code I was copying from others'
patches, and I could easily have missed a detail like this. (I didn't
"git blame" to verify that these parts are mine, though).

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#22)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

I wrote:

As for the question of SIGQUIT handling, I see that postgres.c
does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset
change, so there probably isn't any harm in having the background
processes do likewise.

Concretely, something about like this (I just did the bgwriter, but
we'd want the same in all the background processes). I tried to
respond to Robert's complaint about the inaccurate comment just above
sigsetjmp, too.

This passes check-world, for what little that's worth.

regards, tom lane

Attachments:

clean-up-background-SIGQUIT-handling-wip.patchtext/x-diff; charset=us-ascii; name=clean-up-background-SIGQUIT-handling-wip.patchDownload
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..3ae7901bf6 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -117,6 +117,7 @@ BackgroundWriterMain(void)
 
 	/* We allow SIGQUIT (quickdie) at all times */
 	sigdelset(&BlockSig, SIGQUIT);
+	PG_SETMASK(&BlockSig);
 
 	/*
 	 * We just started, assume there has been either a shutdown or
@@ -140,7 +141,19 @@ BackgroundWriterMain(void)
 	/*
 	 * If an exception is encountered, processing resumes here.
 	 *
-	 * See notes in postgres.c about the design of this coding.
+	 * You might wonder why this isn't coded as an infinite loop around a
+	 * PG_TRY construct.  The reason is that this is the bottom of the
+	 * exception stack, and so with PG_TRY there would be no exception handler
+	 * in force at all during the CATCH part.  By leaving the outermost setjmp
+	 * always active, we have at least some chance of recovering from an error
+	 * during error recovery.  (If we get into an infinite loop thereby, it
+	 * will soon be stopped by overflow of elog.c's internal state stack.)
+	 *
+	 * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
+	 * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
+	 * signals will be blocked until we complete error recovery.  It might
+	 * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but
+	 * it is not since InterruptPending might be set already.
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
#25Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#24)
Re: Parallel worker hangs while handling errors.

On Thu, Sep 3, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Concretely, something about like this (I just did the bgwriter, but
we'd want the same in all the background processes). I tried to
respond to Robert's complaint about the inaccurate comment just above
sigsetjmp, too.

This passes check-world, for what little that's worth.

Seems totally reasonable from here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Sep 3, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Concretely, something about like this (I just did the bgwriter, but
we'd want the same in all the background processes). I tried to
respond to Robert's complaint about the inaccurate comment just above
sigsetjmp, too.
This passes check-world, for what little that's worth.

Seems totally reasonable from here.

OK, I did the same in other relevant places and pushed it.

It's not clear to me whether we want to institute the "accepting SIGQUIT
is always okay" rule in processes that didn't already have code to change
BlockSig. The relevant processes are pgarch.c, startup.c, bgworker.c,
autovacuum.c (launcher and workers both), and walsender.c. In the first
two of these I doubt it matters, because I don't think they'll ever block
signals again anyway -- they certainly don't have outer sigsetjmp blocks.
And I'm a bit hesitant to mess with bgworker given that we seem to expect
that to be heavily used by extension code, and we're exposing code to
allow extensions to mess with the signal blocking state. On the other
hand, as long as SIGQUIT is pointing at SignalHandlerForCrashExit, it's
hard to see a reason why holding it off could be necessary. So maybe
having a uniform rule would be good.

Any thoughts about that?

regards, tom lane

#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
Re: Parallel worker hangs while handling errors.

On Fri, Sep 11, 2020 at 4:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

It's not clear to me whether we want to institute the "accepting SIGQUIT
is always okay" rule in processes that didn't already have code to change
BlockSig. The relevant processes are pgarch.c, startup.c, bgworker.c,
autovacuum.c (launcher and workers both), and walsender.c. In the first
two of these I doubt it matters, because I don't think they'll ever block
signals again anyway -- they certainly don't have outer sigsetjmp blocks.
And I'm a bit hesitant to mess with bgworker given that we seem to expect
that to be heavily used by extension code, and we're exposing code to
allow extensions to mess with the signal blocking state. On the other
hand, as long as SIGQUIT is pointing at SignalHandlerForCrashExit, it's
hard to see a reason why holding it off could be necessary. So maybe
having a uniform rule would be good.

Any thoughts about that?

I think a backend process that isn't timely handling SIGQUIT is by
that very fact buggy. "pg_ctl stop -mi" isn't a friendly suggestion.
So I favor trying to make it uniform.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#27)
1 attachment(s)
Re: Parallel worker hangs while handling errors.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Sep 11, 2020 at 4:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

It's not clear to me whether we want to institute the "accepting SIGQUIT
is always okay" rule in processes that didn't already have code to change
BlockSig.

I think a backend process that isn't timely handling SIGQUIT is by
that very fact buggy. "pg_ctl stop -mi" isn't a friendly suggestion.
So I favor trying to make it uniform.

Well, if we want to take a hard line about that, we should centralize
the setup of SIGQUIT. The attached makes InitPostmasterChild do it,
and removes the duplicative code from elsewhere.

I also flipped autovacuum and walsender from using quickdie to using
SignalHandlerForCrashExit. Whatever you think about the safety or
unsafety of quickdie, there seems no reason for autovacuum to be trying
to tell its nonexistent client about a shutdown. I don't think it's
terribly useful for a walsender either, though maybe somebody has a
different opinion about that?

regards, tom lane

Attachments:

centralize-sigquit-setup.patchtext/x-diff; charset=us-ascii; name=centralize-sigquit-setup.patchDownload
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 19ba26b914..2cef56f115 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -454,8 +454,8 @@ AutoVacLauncherMain(int argc, char *argv[])
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, StatementCancelHandler);
 	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 
-	pqsignal(SIGQUIT, quickdie);
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 
 	pqsignal(SIGPIPE, SIG_IGN);
@@ -498,9 +498,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 	 *
 	 * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
 	 * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
-	 * signals will be blocked until we complete error recovery.  It might
-	 * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but
-	 * it is not since InterruptPending might be set already.
+	 * signals other than SIGQUIT will be blocked until we complete error
+	 * recovery.  It might seem that this policy makes the HOLD_INTERRUPTS()
+	 * call redundant, but it is not since InterruptPending might be set
+	 * already.
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -1531,7 +1532,8 @@ AutoVacWorkerMain(int argc, char *argv[])
 	 */
 	pqsignal(SIGINT, StatementCancelHandler);
 	pqsignal(SIGTERM, die);
-	pqsignal(SIGQUIT, quickdie);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 
 	pqsignal(SIGPIPE, SIG_IGN);
@@ -1562,9 +1564,9 @@ AutoVacWorkerMain(int argc, char *argv[])
 	 *
 	 * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
 	 * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
-	 * signals will be blocked until we exit.  It might seem that this policy
-	 * makes the HOLD_INTERRUPTS() call redundant, but it is not since
-	 * InterruptPending might be set already.
+	 * signals other than SIGQUIT will be blocked until we exit.  It might
+	 * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but
+	 * it is not since InterruptPending might be set already.
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d043ced686..5a9a0e3435 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -731,9 +731,9 @@ StartBackgroundWorker(void)
 		pqsignal(SIGFPE, SIG_IGN);
 	}
 	pqsignal(SIGTERM, bgworker_die);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGHUP, SIG_IGN);
 
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 
 	pqsignal(SIGPIPE, SIG_IGN);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c96568149f..a7afa758b6 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -104,7 +104,7 @@ BackgroundWriterMain(void)
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -115,10 +115,6 @@ BackgroundWriterMain(void)
 	 */
 	pqsignal(SIGCHLD, SIG_DFL);
 
-	/* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-	sigdelset(&BlockSig, SIGQUIT);
-	PG_SETMASK(&BlockSig);
-
 	/*
 	 * We just started, assume there has been either a shutdown or
 	 * end-of-recovery snapshot.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 45f5deca72..3e7dcd4f76 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -198,7 +198,7 @@ CheckpointerMain(void)
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -209,10 +209,6 @@ CheckpointerMain(void)
 	 */
 	pqsignal(SIGCHLD, SIG_DFL);
 
-	/* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-	sigdelset(&BlockSig, SIGQUIT);
-	PG_SETMASK(&BlockSig);
-
 	/*
 	 * Initialize so that first time-driven event happens at the correct time.
 	 */
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..ed1b65358d 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -228,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, pgarch_waken);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3cd6fa30eb..959e3b8873 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4355,7 +4355,7 @@ BackendInitialize(Port *port)
 	 * cleaned up.
 	 */
 	pqsignal(SIGTERM, process_startup_packet_die);
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	PG_SETMASK(&StartupBlockSig);
 
@@ -4435,7 +4435,7 @@ BackendInitialize(Port *port)
 	status = ProcessStartupPacket(port, false, false);
 
 	/*
-	 * Disable the timeout, and prevent SIGTERM/SIGQUIT again.
+	 * Disable the timeout, and prevent SIGTERM again.
 	 */
 	disable_timeout(STARTUP_PACKET_TIMEOUT, false);
 	PG_SETMASK(&BlockSig);
@@ -4983,10 +4983,6 @@ SubPostmasterMain(int argc, char *argv[])
 	if (strcmp(argv[1], "--forkavworker") == 0)
 		AutovacuumWorkerIAm();
 
-	/* In EXEC_BACKEND case we will not have inherited these settings */
-	pqinitmask();
-	PG_SETMASK(&BlockSig);
-
 	/* Read in remaining GUC variables */
 	read_nondefault_variables();
 
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index fd9ac35dac..64af7b8707 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -175,7 +175,7 @@ StartupProcessMain(void)
 	pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
 	pqsignal(SIGINT, SIG_IGN);	/* ignore query cancel */
 	pqsignal(SIGTERM, StartupProcShutdownHandler);	/* request shutdown */
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 358c0916ac..a52832fe90 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -101,7 +101,7 @@ WalWriterMain(void)
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, SignalHandlerForShutdownRequest);
 	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -112,10 +112,6 @@ WalWriterMain(void)
 	 */
 	pqsignal(SIGCHLD, SIG_DFL);
 
-	/* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-	sigdelset(&BlockSig, SIGQUIT);
-	PG_SETMASK(&BlockSig);
-
 	/*
 	 * Create a memory context that we will do all our work in.  We do this so
 	 * that we can reset the context during error recovery and thereby avoid
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b180598507..17f1a49f87 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -270,7 +270,7 @@ WalReceiverMain(void)
 	pqsignal(SIGHUP, WalRcvSigHupHandler);	/* set flag to read config file */
 	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, WalRcvShutdownHandler);	/* request shutdown */
-	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -279,10 +279,6 @@ WalReceiverMain(void)
 	/* Reset some signals that are accepted by postmaster but not here */
 	pqsignal(SIGCHLD, SIG_DFL);
 
-	/* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-	sigdelset(&BlockSig, SIGQUIT);
-	PG_SETMASK(&BlockSig);
-
 	/* Load the libpq-specific functions */
 	load_file("libpqwalreceiver", false);
 	if (WalReceiverFunctions == NULL)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3f756b470a..5e6ae6fde5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3035,7 +3035,7 @@ WalSndSignals(void)
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, StatementCancelHandler);	/* query cancel */
 	pqsignal(SIGTERM, die);		/* request shutdown */
-	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c9424f167c..411cfadbff 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3820,7 +3820,8 @@ PostgresMain(int argc, char *argv[],
 	}
 
 	/*
-	 * Set up signal handlers and masks.
+	 * Set up signal handlers.  (InitPostmasterChild or InitStandaloneProcess
+	 * has already set up BlockSig and made that the active signal mask.)
 	 *
 	 * Note that postmaster blocked all signals before forking child process,
 	 * so there is no race condition whereby we might receive a signal before
@@ -3842,6 +3843,9 @@ PostgresMain(int argc, char *argv[],
 		pqsignal(SIGTERM, die); /* cancel current query and exit */
 
 		/*
+		 * In a postmaster child backend, replace SignalHandlerForCrashExit
+		 * with quickdie, so we can tell the client we're dying.
+		 *
 		 * In a standalone backend, SIGQUIT can be generated from the keyboard
 		 * easily, while SIGTERM cannot, so we make both signals do die()
 		 * rather than quickdie().
@@ -3871,16 +3875,6 @@ PostgresMain(int argc, char *argv[],
 									 * platforms */
 	}
 
-	pqinitmask();
-
-	if (IsUnderPostmaster)
-	{
-		/* We allow SIGQUIT (quickdie) at all times */
-		sigdelset(&BlockSig, SIGQUIT);
-	}
-
-	PG_SETMASK(&BlockSig);		/* block everything except SIGQUIT */
-
 	if (!IsUnderPostmaster)
 	{
 		/*
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cf8f9579c3..ed2ab4b5b2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -32,10 +32,12 @@
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
+#include "libpq/pqsignal.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -133,6 +135,23 @@ InitPostmasterChild(void)
 		elog(FATAL, "setsid() failed: %m");
 #endif
 
+	/* In EXEC_BACKEND case we will not have inherited BlockSig etc values */
+#ifdef EXEC_BACKEND
+	pqinitmask();
+#endif
+
+	/*
+	 * Every postmaster child process is expected to respond promptly to
+	 * SIGQUIT at all times.  Therefore we centrally remove SIGQUIT from
+	 * BlockSig and install a suitable signal handler.  (Client-facing
+	 * processes may choose to replace this default choice of handler with
+	 * quickdie().)  All other blockable signals remain blocked for now.
+	 */
+	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+
+	sigdelset(&BlockSig, SIGQUIT);
+	PG_SETMASK(&BlockSig);
+
 	/* Request a signal if the postmaster dies, if possible. */
 	PostmasterDeathSignalInit();
 }
@@ -155,6 +174,13 @@ InitStandaloneProcess(const char *argv0)
 	InitLatch(MyLatch);
 	InitializeLatchWaitSet();
 
+	/*
+	 * For consistency with InitPostmasterChild, initialize signal mask here.
+	 * But we don't unblock SIGQUIT or provide a default handler for it.
+	 */
+	pqinitmask();
+	PG_SETMASK(&BlockSig);
+
 	/* Compute paths, no postmaster to inherit from */
 	if (my_exec_path[0] == '\0')
 	{