Add guc to enable send SIGSTOP to peers when backend exits abnormally

Started by 蔡梦娟(玊于)over 4 years ago4 messages

mengjuan.cmj@alibaba-inc.com

over 4 years ago

1 attachment(s)

Hi all

I want to share a patch with you, in which I add a guc parameter 'enable_send_stop' to enable set the value of SendStop in postmaster.c more conveniently. SendStop enable postmaster to send SIGSTOP rather than SIGQUIT to its children when some backend dumps core, and this variable is originally set with -T parameter when start postgres, which is inconvenient to control in some scenarios.

Thanks & Best Regards

Attachments:

0001-Add-guc-to-set-SendStop.patchapplication/octet-streamDownload

From b26a2631e66dccdb822ecbcfef88b940e97898ec Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Fri, 3 Sep 2021 12:11:50 +0800
Subject: [PATCH] Add guc parameter to enable send SIGSTOP to peers when
 backend exits abnormally

---
 src/backend/postmaster/postmaster.c | 25 ++++++++++++++++++++++---
 src/backend/utils/misc/guc.c        | 18 ++++++++++++++++++
 src/include/postmaster/postmaster.h |  1 +
 src/include/utils/guc.h             |  2 ++
 4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9c2c98614a..183c16606f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -223,7 +223,7 @@ static pgsocket ListenSocket[MAXLISTEN];
 /*
  * These globals control the behavior of the postmaster in case some
  * backend dumps core.  Normally, it kills all peers of the dead backend
- * and reinitializes shared memory.  By specifying -s or -n, we can have
+ * and reinitializes shared memory.  By specifying -T or -n, we can have
  * the postmaster stop (rather than kill) peers and not reinitialize
  * shared data structures.  (Reinit is currently dead code, though.)
  */
@@ -816,6 +816,12 @@ PostmasterMain(int argc, char *argv[])
 				 * post_hacker collect core dumps from everyone.
 				 */
 				SendStop = true;
+				/* 
+				 * set guc parameter enable_send_stop at the same time when specified -T on command line.
+				 * avoid the value of SendStop being reset by enable_send_stop in postgresql.conf
+				 * the priority of setting guc parameter by command line is higher than that by postgresql.conf
+				 */
+				SetConfigOption("enable_send_stop", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
 			case 't':
@@ -3501,7 +3507,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 			 *
 			 * SIGQUIT is the special signal that says exit without proc_exit
 			 * and let the user know what's going on. But if SendStop is set
-			 * (-T on command line), then we send SIGSTOP instead, so that we
+			 * (-T on command line or set guc parameter enable_send_stop), 
+			 * then we send SIGSTOP instead, so that we
 			 * can get core dumps from all backends by hand.
 			 */
 			if (take_action)
@@ -3544,7 +3551,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 			 *
 			 * SIGQUIT is the special signal that says exit without proc_exit
 			 * and let the user know what's going on. But if SendStop is set
-			 * (-T on command line), then we send SIGSTOP instead, so that we
+			 * (-T on command line or set guc parameter enable_send_stop), 
+			 * then we send SIGSTOP instead, so that we
 			 * can get core dumps from all backends by hand.
 			 *
 			 * We could exclude dead_end children here, but at least in the
@@ -6616,3 +6624,14 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+
+/*
+ * set value of SendStop according to guc parameter enable_send_stop
+ * so that it can be set not only by specifying -T when start postgres
+ */
+void
+assign_enable_send_stop(bool newval, void *extra)
+{
+	SendStop = newval;
+}
\ No newline at end of file
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 29851119f1..a6ca8511b4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -584,6 +584,7 @@ char	   *event_source;
 
 bool		row_security;
 bool		check_function_bodies = true;
+bool		enable_send_stop = false;
 
 /*
  * This GUC exists solely for backward compatibility, check its definition for
@@ -2116,6 +2117,23 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * In the event that some backend dumps core, send SIGSTOP rather than SIGQUIT
+	 * to all its peers, this lets the wily post_hacker collect core dumps from everyone.
+	 * in original way, this is set with -T parameter when start postgres
+	 * changes to set via guc parameter enable_send_stop
+	 */
+	{
+		{"enable_send_stop", PGC_SIGHUP, UNGROUPED,
+			gettext_noop("Enable to send SIGSTOP to all peers when some backend exit abnormally"),
+			NULL,
+			GUC_NO_SHOW_ALL | GUC_NO_RESET_ALL
+		},
+		&enable_send_stop,
+		false,
+		NULL, assign_enable_send_stop, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 0efdd7c232..77056f7a6e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -54,6 +54,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_enable_send_stop(bool newval, void *extra);
 
 #ifdef EXEC_BACKEND
 extern pid_t postmaster_forkexec(int argc, char *argv[]);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index a7c3a4958e..feec65cb16 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -282,6 +282,8 @@ extern int	tcp_user_timeout;
 extern bool trace_sort;
 #endif
 
+extern bool enable_send_stop;
+
 /*
  * Functions exported by guc.c
  */
-- 
2.24.3 (Apple Git-128)

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: 蔡梦娟(玊于) (#1)

Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally

"=?UTF-8?B?6JSh5qKm5aifKOeOiuS6jik=?=" <mengjuan.cmj@alibaba-inc.com> writes:

I want to share a patch with you, in which I add a guc parameter 'enable_send_stop' to enable set the value of SendStop in postmaster.c more conveniently. SendStop enable postmaster to send SIGSTOP rather than SIGQUIT to its children when some backend dumps core, and this variable is originally set with -T parameter when start postgres, which is inconvenient to control in some scenarios.

TBH, I'd sooner rip out SendStop, and simplify the related postmaster
logic. I've never used it in twenty-some years of Postgres hacking,
and I doubt anyone else has used it much either. It's not worth the
overhead of a GUC. (The argument that you need it in situations
where you can't control the postmaster's command line seems pretty
thin, too. I'm much more worried about somebody turning it on by
accident and then complaining that the cluster freezes upon crash.)

regards, tom lane

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Tom Lane (#2)

1 attachment(s)

Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally

On 2021-Sep-03, Tom Lane wrote:

"=?UTF-8?B?6JSh5qKm5aifKOeOiuS6jik=?=" <mengjuan.cmj@alibaba-inc.com> writes:

I want to share a patch with you, in which I add a guc parameter 'enable_send_stop' to enable set the value of SendStop in postmaster.c more conveniently. SendStop enable postmaster to send SIGSTOP rather than SIGQUIT to its children when some backend dumps core, and this variable is originally set with -T parameter when start postgres, which is inconvenient to control in some scenarios.

TBH, I'd sooner rip out SendStop, and simplify the related postmaster
logic.

I wrote a patch to do that in 2012, after this exchange:
/messages/by-id/1333124720-sup-6193@alvh.no-ip.org
I obviously doesn't apply at all anymore, but the thing that prevented
me from sending it was I couldn't find what the mentioned feature was
that would cause all backends to dump core at the time of a crash.
So it seemed to me that we would be ripping out a feature I had used,
with no replacement.

(It applies cleanly on top of 36b7e3da17bc.)

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"Cuando no hay humildad las personas se degradan" (A. Christie)

Attachments:

0001-Remove-T-postmaster-option.patchtext/x-diff; charset=utf-8Download

From d8198f6ad6c814c03c486bcce696fda912393405 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 23 May 2012 22:18:21 -0400
Subject: [PATCH] Remove -T postmaster option

This used to tell postmaster to send SIGSTOP instead of SIGQUIT in case
of trouble, with the intent of letting a hypothetical PG hacker collect
core dumps from every backends by attaching a debugger to them, one by
one.  This has long been superseded by the ability of current operating
systems to dump core of several processes simultaneously.
---
 src/backend/postmaster/postmaster.c | 52 +++++++++--------------------
 1 file changed, 16 insertions(+), 36 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eeea933b19..bf6166caf4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -187,7 +187,6 @@ static char ExtraOptions[MAXPGPATH];
  * shared data structures.	(Reinit is currently dead code, though.)
  */
 static bool Reinit = true;
-static int	SendStop = false;
 
 /* still more option variables */
 bool		EnableSSL = false;
@@ -544,7 +543,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "A:B:bc:C:D:d:EeFf:h:ijk:lN:nOo:Pp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "A:B:bc:C:D:d:EeFf:h:ijk:lN:nOo:Pp:r:S:st:W:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -654,16 +653,6 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("log_statement_stats", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
-			case 'T':
-
-				/*
-				 * In the event that some backend dumps core, send SIGSTOP,
-				 * rather than SIGQUIT, to all its peers.  This lets the wily
-				 * post_hacker collect core dumps from everyone.
-				 */
-				SendStop = true;
-				break;
-
 			case 't':
 				{
 					const char *tmp = get_stats_option_name(optarg);
@@ -2704,9 +2693,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 			 * to commit hara-kiri.
 			 *
 			 * SIGQUIT is the special signal that says exit without proc_exit
-			 * and let the user know what's going on. But if SendStop is set
-			 * (-s on command line), then we send SIGSTOP instead, so that we
-			 * can get core dumps from all backends by hand.
+			 * and let the user know what's going on.
 			 *
 			 * We could exclude dead_end children here, but at least in the
 			 * SIGSTOP case it seems better to include them.
@@ -2715,9 +2702,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 			{
 				ereport(DEBUG2,
 						(errmsg_internal("sending %s to process %d",
-										 (SendStop ? "SIGSTOP" : "SIGQUIT"),
-										 (int) bp->pid)));
-				signal_child(bp->pid, (SendStop ? SIGSTOP : SIGQUIT));
+										 "SIGQUIT", (int) bp->pid)));
+				signal_child(bp->pid, SIGQUIT);
 			}
 		}
 	}
@@ -2729,9 +2715,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("sending %s to process %d",
-								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
-								 (int) StartupPID)));
-		signal_child(StartupPID, (SendStop ? SIGSTOP : SIGQUIT));
+								 "SIGQUIT", (int) StartupPID)));
+		signal_child(StartupPID, SIGQUIT);
 	}
 
 	/* Take care of the bgwriter too */
@@ -2741,9 +2726,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("sending %s to process %d",
-								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
-								 (int) BgWriterPID)));
-		signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
+								 "SIGQUIT", (int) BgWriterPID)));
+		signal_child(BgWriterPID, SIGQUIT);
 	}
 
 	/* Take care of the checkpointer too */
@@ -2753,9 +2737,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("sending %s to process %d",
-								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
-								 (int) CheckpointerPID)));
-		signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
+								 "SIGQUIT", (int) CheckpointerPID)));
+		signal_child(CheckpointerPID, SIGQUIT);
 	}
 
 	/* Take care of the walwriter too */
@@ -2765,9 +2748,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("sending %s to process %d",
-								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
-								 (int) WalWriterPID)));
-		signal_child(WalWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
+								 "SIGQUIT", (int) WalWriterPID)));
+		signal_child(WalWriterPID, SIGQUIT);
 	}
 
 	/* Take care of the walreceiver too */
@@ -2777,9 +2759,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("sending %s to process %d",
-								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
-								 (int) WalReceiverPID)));
-		signal_child(WalReceiverPID, (SendStop ? SIGSTOP : SIGQUIT));
+								 "SIGQUIT", (int) WalReceiverPID)));
+		signal_child(WalReceiverPID, SIGQUIT);
 	}
 
 	/* Take care of the autovacuum launcher too */
@@ -2789,9 +2770,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	{
 		ereport(DEBUG2,
 				(errmsg_internal("sending %s to process %d",
-								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
-								 (int) AutoVacPID)));
-		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
+								 "SIGQUIT", (int) AutoVacPID)));
+		signal_child(AutoVacPID, SIGQUIT);
 	}
 
 	/*
-- 
2.30.2

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Alvaro Herrera (#3)

Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2021-Sep-03, Tom Lane wrote:

TBH, I'd sooner rip out SendStop, and simplify the related postmaster
logic.

I wrote a patch to do that in 2012, after this exchange:
/messages/by-id/1333124720-sup-6193@alvh.no-ip.org
I obviously doesn't apply at all anymore, but the thing that prevented
me from sending it was I couldn't find what the mentioned feature was
that would cause all backends to dump core at the time of a crash.

Oh, I think you misunderstood what I wrote. I was thinking of the
ancient habit of most kernels to dump cores to a file just named
"core"; so that even if you went around and manually SIGABRT'd
each stopped process, the cores would all overwrite each other,
leaving you with little to show for the exercise. Nowadays you're
more likely to get "core.NNN" for each PID, so that it could in
principle be useful to force all the backends to dump core for later
analysis. But I know of no mechanism that would do that for you.

However, thinking about this afresh, it seems like that Berkeley-era
comment about "the wily post_hacker" was never very apropos. If what
you wanted was a few GB of core files for later analysis, it'd make
more sense to have the postmaster send SIGABRT or the like. That
saves a bunch of tedious manual steps, plus the cluster isn't left
in a funny state that requires yet more manual cleanup steps.

So I'm thinking that the *real* use-case for this is for developers
to attach with gdb and do on-the-fly investigation of the state of
other backends, rather than forcing core-dumps. However, it's still
a pretty half-baked feature because there's no easy way to clean up
afterwards.

The other elephant in the room is that by the time the postmaster
has reacted to the initial backend crash, it's dubious whether the
state of other processes is still able to tell you much. (IME,
at least, the postmaster doesn't hear about it until the kernel
has finished writing out the dying process's core image, which
takes approximately forever compared to modern CPU speeds.)

So it seemed to me that we would be ripping out a feature I had used,
with no replacement.

If we had a really useful feature here I'd be all over it.
But it looks more like somebody's ten-minute hack, so the
fact that it's undocumented and obscure-to-invoke seems
appropriate to me.

(BTW, I think we had exactly this discussion way back when
Peter cleaned up the postmaster/postgres command line switches.
Just about all the other old switches have equivalent GUCs,
and IIRC it is not an oversight that SendStop was left out.)

regards, tom lane