Parallel query vs smart shutdown and Postmaster death

Started by Thomas Munroalmost 7 years ago6 messages
#1Thomas Munro
thomas.munro@gmail.com

Hello hackers,

1. In a nearby thread, I misdiagnosed a problem reported[1]/messages/by-id/CAEepm=0kMunPC0hhuT0VC-5dfMT3K-xsToJHkTznA6yrSARsPg@mail.gmail.com by Justin
Pryzby (though my misdiagnosis is probably still a thing to be fixed;
see next). I think I just spotted the real problem he saw: if you
execute a parallel query after a smart shutdown has been initiated,
you wait forever in gather_readnext()! Maybe parallel workers can't
be launched in this state, but we lack code to detect this case? I
haven't dug into the exact mechanism or figured out what to do about
it yet, and I'm tied up with something else for a bit, but I will come
back to this later if nobody beats me to it.

2. Commit cfdf4dc4 on the master branch fixed up all known waits that
didn't respond to postmaster death, and added an assertion to that
effect. One of the cases fixed was in gather_readnext(), and
initially I thought that's what Justin was telling us about (his
report was from 11.x), until I reread his message and saw that it was
SIGTERM and not eg SIGKILL. I should probably go and back-patch a fix
for that case anyway... but now I'm wondering, was there a reason for
that omission, and likewise for mq_putmessage()?

(Another case of missing PM death detection in the back-branches is
postgres_fdw.)

[1]: /messages/by-id/CAEepm=0kMunPC0hhuT0VC-5dfMT3K-xsToJHkTznA6yrSARsPg@mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#1)
1 attachment(s)
Re: Parallel query vs smart shutdown and Postmaster death

On Mon, Feb 25, 2019 at 2:13 PM Thomas Munro <thomas.munro@gmail.com> wrote:

1. In a nearby thread, I misdiagnosed a problem reported[1] by Justin
Pryzby (though my misdiagnosis is probably still a thing to be fixed;
see next). I think I just spotted the real problem he saw: if you
execute a parallel query after a smart shutdown has been initiated,
you wait forever in gather_readnext()! Maybe parallel workers can't
be launched in this state, but we lack code to detect this case? I
haven't dug into the exact mechanism or figured out what to do about
it yet, and I'm tied up with something else for a bit, but I will come
back to this later if nobody beats me to it.

Given smart shutdown's stated goal, namely that it "lets existing
sessions end their work normally", my questions are:

1. Why does pmdie()'s SIGTERM case terminate parallel workers
immediately? That breaks aborts running parallel queries, so they
don't get to end their work normally.
2. Why are new parallel workers not allowed to be started while in
this state? That hangs future parallel queries forever, so they don't
get to end their work normally.
3. Suppose we fix the above cases; should we do it for parallel
workers only (somehow), or for all bgworkers? It's hard to say since
I don't know what all bgworkers do.

In the meantime, perhaps we should teach the postmaster to report this
case as a failure to launch in back-branches, so that at least
parallel queries don't hang forever? Here's an initial sketch of a
patch like that: it gives you "ERROR: parallel worker failed to
initialize" and "HINT: More details may be available in the server
log." if you try to run a parallel query. The HINT is right, the
server logs say that a smart shutdown is in progress. If that seems a
bit hostile, consider that any parallel queries that were running at
the moment the smart shutdown was requested have already been ordered
to quit; why should new queries started after that get a better deal?
Then perhaps we could do some more involved surgery on master that
achieves smart shutdown's stated goal here, and lets parallel queries
actually run? Better ideas welcome.

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Report-bgworker-launch-failure-during-smart-shutdown.patchapplication/octet-stream; name=0001-Report-bgworker-launch-failure-during-smart-shutdown.patchDownload
From b816759e78c79600118abb68747003c8f6b915ea Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 27 Feb 2019 11:29:01 +1300
Subject: [PATCH] Report bgworker launch failure during smart shutdown.

If a smart shutdown has been initiated, bgworkers will never be
launched.  Tell any process that might be waiting for them, so that
(for example) parallel query doesn't wait forever to learn the
fate of the workers it tried to launch.

This means that after a smart shutdown has been initiated (and all
currently running parallel workers are forced to quit), future
parallel query executions will fail with:

  ERROR:  parallel worker failed to initialize

This isn't ideal, but it avoids waiting forever.  In future a more
graceful fix that allows parallel queries to run might be developed.

Author: Thomas Munro
Reported-by: Justin Pryzby
Discussion: https://postgr.es/m/CA%2BhUKGLrJij0BuFtHsMHT4QnLP54Z3S6vGVBCWR8A49%2BNzctCw%40mail.gmail.com
---
 src/backend/postmaster/postmaster.c | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccea231e98..c67139c7a6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5713,6 +5713,28 @@ do_start_bgworker(RegisteredBgWorker *rw)
 	return false;
 }
 
+/*
+ * Are we in a state that allows bgworkers to be launched (eventually), or are
+ * we in the process of shutting down?
+ */
+static bool
+bgworker_can_be_started(void)
+{
+	switch (pmState)
+	{
+		case PM_NO_CHILDREN:
+		case PM_WAIT_DEAD_END:
+		case PM_SHUTDOWN_2:
+		case PM_SHUTDOWN:
+		case PM_WAIT_BACKENDS:
+		case PM_WAIT_READONLY:
+		case PM_WAIT_BACKUP:
+			return false;
+		default:
+			return true;
+	}
+}
+
 /*
  * Does the current postmaster state require starting a worker with the
  * specified start_time?
@@ -5852,6 +5874,26 @@ maybe_start_bgworkers(void)
 			continue;
 		}
 
+		/*
+		 * If we're in the process of shutting down, we can't create any new
+		 * bgworkers.  We'd better tell any process that is waiting for news
+		 * of the bgworker that it will never be launched.
+		 */
+		if (!bgworker_can_be_started())
+		{
+			int			notify_pid;
+
+			notify_pid = rw->rw_worker.bgw_notify_pid;
+
+			ForgetBackgroundWorker(&iter);
+
+			/* Report worker is gone now. */
+			if (notify_pid != 0)
+				kill(notify_pid, SIGUSR1);
+
+			continue;
+		}
+
 		/*
 		 * If this worker has crashed previously, maybe it needs to be
 		 * restarted (unless on registration it specified it doesn't want to
-- 
2.20.1

#3Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#2)
Re: Parallel query vs smart shutdown and Postmaster death

On Tue, Feb 26, 2019 at 5:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Then perhaps we could do some more involved surgery on master that
achieves smart shutdown's stated goal here, and lets parallel queries
actually run? Better ideas welcome.

I have noticed before that the smart shutdown code does not disclose
to the rest of the system that a shutdown is in progress: no signals
are sent, and no shared memory state is updated. That makes it a bit
challenging for any other part of the system to respond to the smart
shutdown in a way that is, well, smart. But I guess that's not really
the problem in this case.

It seems to me that we could fix pmdie() to skip parallel workers; I
think that the postmaster could notice that they are lagged as
BGWORKER_CLASS_PARALLEL. But we'd also have to fix things so that new
parallel workers could be launched during a smart shutdown, which
looks not quite so simple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Arseny Sher
a.sher@postgrespro.ru
In reply to: Thomas Munro (#2)
1 attachment(s)
Re: Parallel query vs smart shutdown and Postmaster death

Hi,

Thomas Munro <thomas.munro@gmail.com> writes:

1. Why does pmdie()'s SIGTERM case terminate parallel workers
immediately? That breaks aborts running parallel queries, so they
don't get to end their work normally.
2. Why are new parallel workers not allowed to be started while in
this state? That hangs future parallel queries forever, so they don't
get to end their work normally.
3. Suppose we fix the above cases; should we do it for parallel
workers only (somehow), or for all bgworkers? It's hard to say since
I don't know what all bgworkers do.

Attached patch fixes 1 and 2. As for the 3, the only other internal
bgworkers currently are logical replication launcher and workers, which
arguably should be killed immediately.

Here's an initial sketch of a
patch like that: it gives you "ERROR: parallel worker failed to
initialize" and "HINT: More details may be available in the server
log." if you try to run a parallel query.

Reporting bgworkers that postmaster is never going to start is probably
worthwhile doing anyway to prevent potential further deadlocks like
this. However, I think there is a problem in your patch: we might be in
post PM_RUN states due to FatalError, not because of shutdown. In this
case, we shouldn't refuse to run bgws in the future. I would also merge
the check into bgworker_should_start_now.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-Allow-parallel-workers-while-backends-are-alive-in-s.patchtext/x-diffDownload
From da11d22a5ed78a690ccdbfeb77c59749c541d463 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 44 ++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..d60969fdb8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2689,10 +2689,30 @@ pmdie(SIGNAL_ARGS)
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
 			{
+				slist_iter siter;
+
+				/*
+				 * Shut down all bgws except parallel workers: graceful
+				 * termination of existing sessions needs them.
+				 */
+				slist_foreach(siter, &BackgroundWorkerList)
+				{
+					RegisteredBgWorker *rw;
+
+					rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+					if (rw->rw_pid > 0 &&
+						((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0))
+					{
+						ereport(DEBUG4,
+								(errmsg_internal("sending signal %d to process %d",
+												 SIGTERM, (int) rw->rw_pid)));
+						signal_child(rw->rw_pid, SIGTERM);
+
+					}
+				}
+
 				/* autovac workers are told to shut down immediately */
-				/* and bgworkers too; does this need tweaking? */
-				SignalSomeChildren(SIGTERM,
-								   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+				SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
@@ -5735,18 +5755,30 @@ do_start_bgworker(RegisteredBgWorker *rw)
  * specified start_time?
  */
 static bool
-bgworker_should_start_now(BgWorkerStartTime start_time)
+bgworker_should_start_now(RegisteredBgWorker *rw)
 {
+	BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time;
+
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
 		case PM_SHUTDOWN_2:
 		case PM_SHUTDOWN:
+			return false;
+
 		case PM_WAIT_BACKENDS:
 		case PM_WAIT_READONLY:
 		case PM_WAIT_BACKUP:
-			break;
+			/*
+			 * Allow new parallel workers in SmartShutdown while backends
+			 * are still there.
+			 */
+			if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) ||
+				Shutdown != SmartShutdown ||
+				FatalError)
+				return false;
+			/* fall through */
 
 		case PM_RUN:
 			if (start_time == BgWorkerStart_RecoveryFinished)
@@ -5906,7 +5938,7 @@ maybe_start_bgworkers(void)
 			}
 		}
 
-		if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+		if (bgworker_should_start_now(rw))
 		{
 			/* reset crash time before trying to start worker */
 			rw->rw_crashed_at = 0;
-- 
2.11.0

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Arseny Sher (#4)
Re: Parallel query vs smart shutdown and Postmaster death

On Sun, Mar 17, 2019 at 5:53 PM Arseny Sher <a.sher@postgrespro.ru> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

1. Why does pmdie()'s SIGTERM case terminate parallel workers
immediately? That breaks aborts running parallel queries, so they
don't get to end their work normally.
2. Why are new parallel workers not allowed to be started while in
this state? That hangs future parallel queries forever, so they don't
get to end their work normally.
3. Suppose we fix the above cases; should we do it for parallel
workers only (somehow), or for all bgworkers? It's hard to say since
I don't know what all bgworkers do.

Attached patch fixes 1 and 2. As for the 3, the only other internal
bgworkers currently are logical replication launcher and workers, which
arguably should be killed immediately.

Hi Arseny,

Thanks for working on this! Yes, it seems better to move forwards
rather than backwards, and fix this properly as you have done in this
patch.

Just a thought: instead of the new hand-coded loop you added in
pmdie(), do you think it would make sense to have a new argument
"exclude_class_mask" for SignalSomeChildren()? If we did that, I
would consider renaming the existing parameter "target" to
"include_type_mask" to make it super clear what's going on.

Here's an initial sketch of a
patch like that: it gives you "ERROR: parallel worker failed to
initialize" and "HINT: More details may be available in the server
log." if you try to run a parallel query.

Reporting bgworkers that postmaster is never going to start is probably
worthwhile doing anyway to prevent potential further deadlocks like
this. However, I think there is a problem in your patch: we might be in
post PM_RUN states due to FatalError, not because of shutdown. In this
case, we shouldn't refuse to run bgws in the future. I would also merge
the check into bgworker_should_start_now.

Hmm, yeah, I haven't tested or double-checked in detail yet but I
think you're probably right about all of that.

--
Thomas Munro
https://enterprisedb.com

#6Arseny Sher
a.sher@postgrespro.ru
In reply to: Thomas Munro (#5)
1 attachment(s)
Re: Parallel query vs smart shutdown and Postmaster death

Thomas Munro <thomas.munro@gmail.com> writes:

Just a thought: instead of the new hand-coded loop you added in
pmdie(), do you think it would make sense to have a new argument
"exclude_class_mask" for SignalSomeChildren()? If we did that, I
would consider renaming the existing parameter "target" to
"include_type_mask" to make it super clear what's going on.

I thought that this is a bit too complicated for single use-case, but if
you like it better, here is an updated version.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-Allow-parallel-workers-while-backends-are-alive-i-v2.patchtext/x-diffDownload
From 2e6359df5bf60b9ab6ca6e35fa347993019526db Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 72 ++++++++++++++++++++++++++++---------
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..2ad215b12d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -412,10 +412,10 @@ static void report_fork_failure_to_client(Port *port, int errnum);
 static CAC_state canAcceptConnections(void);
 static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
-static bool SignalSomeChildren(int signal, int targets);
+static bool SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask);
 static void TerminateChildren(int signal);
 
-#define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
+#define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL, 0)
 
 static int	CountChildren(int target);
 static bool assign_backendlist_entry(RegisteredBgWorker *rw);
@@ -2689,10 +2689,12 @@ pmdie(SIGNAL_ARGS)
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
 			{
-				/* autovac workers are told to shut down immediately */
-				/* and bgworkers too; does this need tweaking? */
-				SignalSomeChildren(SIGTERM,
-								   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+				/*
+				 * Shut down all bgws except parallel workers: graceful
+				 * termination of existing sessions needs them.
+				 * Autovac workers are also told to shut down immediately.
+				 */
+				SignalSomeChildren(SIGTERM, BACKEND_TYPE_WORKER, BGWORKER_CLASS_PARALLEL);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
@@ -2752,7 +2754,7 @@ pmdie(SIGNAL_ARGS)
 				signal_child(WalReceiverPID, SIGTERM);
 			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
-				SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
+				SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER, 0);
 
 				/*
 				 * Only startup, bgwriter, walreceiver, possibly bgworkers,
@@ -2773,7 +2775,7 @@ pmdie(SIGNAL_ARGS)
 				/* shut down all backends and workers */
 				SignalSomeChildren(SIGTERM,
 								   BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC |
-								   BACKEND_TYPE_BGWORKER);
+								   BACKEND_TYPE_BGWORKER, 0);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
@@ -3939,26 +3941,52 @@ signal_child(pid_t pid, int signal)
 
 /*
  * Send a signal to the targeted children (but NOT special children;
- * dead_end children are never signaled, either).
+ * dead_end children are never signaled, either). If BACKEND_TYPE_BGWORKER
+ * is in include_type_mask, bgworkers are additionaly filtered out by
+ * exclude_bgw_mask.
  */
 static bool
-SignalSomeChildren(int signal, int target)
+SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask)
 {
 	dlist_iter	iter;
+	slist_iter	siter;
 	bool		signaled = false;
 
+	/*
+	 * Handle bgws by walking over BackgroundWorkerList because we have
+	 * to check out bgw class.
+	 */
+	if ((include_type_mask & BACKEND_TYPE_BGWORKER) != 0)
+	{
+		slist_foreach(siter, &BackgroundWorkerList)
+		{
+			RegisteredBgWorker *rw;
+
+			rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+			if (rw->rw_pid > 0 &&
+				((rw->rw_worker.bgw_flags & exclude_bgw_mask) == 0))
+			{
+				ereport(DEBUG4,
+						(errmsg_internal("sending signal %d to process %d",
+										 signal, (int) rw->rw_pid)));
+				signal_child(rw->rw_pid, signal);
+				signaled = true;
+			}
+		}
+	}
+
 	dlist_foreach(iter, &BackendList)
 	{
 		Backend    *bp = dlist_container(Backend, elem, iter.cur);
 
-		if (bp->dead_end)
+		if (bp->dead_end || bp->bkend_type == BACKEND_TYPE_BGWORKER)
 			continue;
 
 		/*
 		 * Since target == BACKEND_TYPE_ALL is the most common case, we test
 		 * it first and avoid touching shared memory for every child.
 		 */
-		if (target != BACKEND_TYPE_ALL)
+		if (include_type_mask != BACKEND_TYPE_ALL)
 		{
 			/*
 			 * Assign bkend_type for any recently announced WAL Sender
@@ -3968,7 +3996,7 @@ SignalSomeChildren(int signal, int target)
 				IsPostmasterChildWalSender(bp->child_slot))
 				bp->bkend_type = BACKEND_TYPE_WALSND;
 
-			if (!(target & bp->bkend_type))
+			if (!(include_type_mask & bp->bkend_type))
 				continue;
 		}
 
@@ -5735,18 +5763,30 @@ do_start_bgworker(RegisteredBgWorker *rw)
  * specified start_time?
  */
 static bool
-bgworker_should_start_now(BgWorkerStartTime start_time)
+bgworker_should_start_now(RegisteredBgWorker *rw)
 {
+	BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time;
+
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
 		case PM_SHUTDOWN_2:
 		case PM_SHUTDOWN:
+			return false;
+
 		case PM_WAIT_BACKENDS:
 		case PM_WAIT_READONLY:
 		case PM_WAIT_BACKUP:
-			break;
+			/*
+			 * Allow new parallel workers in SmartShutdown while backends
+			 * are still there.
+			 */
+			if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) ||
+				Shutdown != SmartShutdown ||
+				FatalError)
+				return false;
+			/* fall through */
 
 		case PM_RUN:
 			if (start_time == BgWorkerStart_RecoveryFinished)
@@ -5906,7 +5946,7 @@ maybe_start_bgworkers(void)
 			}
 		}
 
-		if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+		if (bgworker_should_start_now(rw))
 		{
 			/* reset crash time before trying to start worker */
 			rw->rw_crashed_at = 0;
-- 
2.11.0