common signal handler protection
In commit 97550c0, I added a "MyProcPid == getpid()" check in the SIGTERM
handler for the startup process. This ensures that processes forked by
system(3) (i.e., for restore_command) that have yet to install their own
signal handlers do not call proc_exit() upon receiving SIGTERM. Without
this protection, both the startup process and the restore_command process
might try to remove themselves from the PGPROC shared array (among other
things), which can end badly.
Since then, I've been exploring a more general approach that would offer
protection against similar issues in the future. We probably don't want
signal handlers in these grandchild processes to touch shared memory at
all. The attached 0001 is an attempt at adding such protection for all
handlers installed via pqsignal(). In short, it stores the actual handler
functions in a separate array, and sigaction() is given a wrapper function
that performs the "MyProcPid == getpid()" check. If that check fails, the
wrapper function installs the default signal handler and calls it.
Besides allowing us to revert commit 97550c0 (see attached 0003), this
wrapper handler could also restore errno, as shown in 0002. Right now,
individual signal handlers must do this today as needed, but that seems
easy to miss and prone to going unnoticed for a long time.
I see two main downsides of this proposal:
* Overhead: The wrapper handler calls a function pointer and getpid(),
which AFAICT is a real system call on most platforms. That might not be
a tremendous amount of overhead, but it's not zero, either. I'm
particularly worried about signal-heavy code like synchronous
replication. (Are there other areas that should be tested?) If this is
a concern, perhaps we could allow certain processes to opt out of this
wrapper handler, provided we believe it is unlikely to fork or that the
handler code is safe to run in grandchild processes.
* Race conditions: With these patches, pqsignal() becomes quite racy when
used within signal handlers. Specifically, you might get a bogus return
value. However, there are no in-tree callers of pqsignal() that look at
the return value (and I don't see any reason they should), and it seems
unlikely that pqsignal() will be used within signal handlers frequently,
so this might not be a deal-breaker. I did consider trying to convert
pqsignal() into a void function, but IIUC that would require an SONAME
bump. For now, I've just documented the bogosity of the return values.
Thoughts?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Check-that-MyProcPid-getpid-in-all-signal-handler.patchtext/x-diff; charset=us-asciiDownload+70-3
v1-0002-Centralize-logic-for-restoring-errno-in-signal-ha.patchtext/x-diff; charset=us-asciiDownload+6-80
v1-0003-Revert-Avoid-calling-proc_exit-in-processes-forke.patchtext/x-diff; charset=us-asciiDownload+1-63
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote:
+#ifdef NSIG +#define PG_NSIG (NSIG) +#else +#define PG_NSIG (64) /* XXX: wild guess */ +#endif
+ Assert(signo < PG_NSIG);
cfbot seems unhappy with this on Windows. IIUC we need to use
PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one
macro for all platforms.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Nov 21, 2023 at 04:40:06PM -0600, Nathan Bossart wrote:
cfbot seems unhappy with this on Windows. IIUC we need to use
PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one
macro for all platforms.
Here's an attempt at fixing the Windows build.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Check-that-MyProcPid-getpid-in-all-signal-handler.patchtext/x-diff; charset=us-asciiDownload+72-3
v2-0002-Centralize-logic-for-restoring-errno-in-signal-ha.patchtext/x-diff; charset=us-asciiDownload+6-80
v2-0003-Revert-Avoid-calling-proc_exit-in-processes-forke.patchtext/x-diff; charset=us-asciiDownload+1-63
Hi,
On 2023-11-22 15:59:44 -0600, Nathan Bossart wrote:
Subject: [PATCH v2 1/3] Check that MyProcPid == getpid() in all signal
handlers.In commit 97550c0711, we added a similar check to the SIGTERM
handler for the startup process. This commit adds this check to
all signal handlers installed with pqsignal(). This is done by
using a wrapper function that performs the check before calling the
actual handler.The hope is that this will offer more general protection against
grandchildren processes inadvertently modifying shared memory due
to inherited signal handlers.
It's a bit unclear here what grandchildren refers to - it's presumably in
reference to postmaster, but the preceding text doesn't even mention
postmaster. I'd probably just say "child processes of the current process.
+ +#ifdef PG_SIGNAL_COUNT /* Windows */ +#define PG_NSIG (PG_SIGNAL_COUNT) +#elif defined(NSIG) +#define PG_NSIG (NSIG) +#else +#define PG_NSIG (64) /* XXX: wild guess */ +#endif
Perhaps worth adding a static assert for at least a few common types of
signals being below that value? That way we'd see a potential issue without
needing to reach the code path.
+/* + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function + * as the handler for all signals. This wrapper handler function checks that + * it is called within a process that the server knows about, and not a + * grandchild process forked by system(3), etc.
Similar comment to earlier - the grandchildren bit seems like a dangling
reference. And also too specific - I think we could encounter this in single
user mode as well?
Perhaps it could be reframed to "postgres processes, as determined by having
called InitProcessGlobals()"?
This check ensures that such + * grandchildren processes do not modify shared memory, which could be + * detrimental.
"could be" seems a bit euphemistic :)
From b77da9c54590a71eb9921d6f16bf4ffb0543e87e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 17 Nov 2023 14:00:12 -0600
Subject: [PATCH v2 2/3] Centralize logic for restoring errno in signal
handlers.Presently, we rely on each individual signal handler to save the
initial value of errno and then restore it before returning if
needed. This is easily forgotten and, if missed, often goes
undetected for a long time.
It's also just verbose :)
From 5734e0cf5f00bbd74504b45934f68e1c2c1290bd Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 17 Nov 2023 22:09:24 -0600
Subject: [PATCH v2 3/3] Revert "Avoid calling proc_exit() in processes forked
by system()."Thanks to commit XXXXXXXXXX, this check in the SIGTERM handler for
the startup process is now obsolete and can be removed. Instead of
leaving around the elog(PANIC, ...) calls that are now unlikely to
be triggered and the dead function write_stderr_signal_safe(), I've
opted to just remove them for now. Thanks to modern version
control software, it will be trivial to dig those up if they are
ever needed in the future.This reverts commit 97550c0711972a9856b5db751539bbaf2f88884c.
Reviewed-by: ???
Discussion: ???
---
src/backend/postmaster/startup.c | 17 +----------------
src/backend/storage/ipc/ipc.c | 4 ----
src/backend/storage/lmgr/proc.c | 8 --------
src/backend/utils/error/elog.c | 28 ----------------------------
src/include/utils/elog.h | 6 ------
5 files changed, 1 insertion(+), 62 deletions(-)diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c index 83dbed86b9..f40acd20ff 100644 --- a/src/backend/postmaster/startup.c +++ b/src/backend/postmaster/startup.c @@ -19,8 +19,6 @@ */ #include "postgres.h"-#include <unistd.h> - #include "access/xlog.h" #include "access/xlogrecovery.h" #include "access/xlogutils.h" @@ -113,20 +111,7 @@ static void StartupProcShutdownHandler(SIGNAL_ARGS) { if (in_restore_command) - { - /* - * If we are in a child process (e.g., forked by system() in - * RestoreArchivedFile()), we don't want to call any exit callbacks. - * The parent will take care of that. - */ - if (MyProcPid == (int) getpid()) - proc_exit(1); - else - { - write_stderr_signal_safe("StartupProcShutdownHandler() called in child process\n"); - _exit(1); - } - } + proc_exit(1); else shutdown_requested = true; WakeupRecovery();
Hm. I wonder if this indicates an issue. Do the preceding changes perhaps
make it more likely that a child process of the startup process could hang
around for longer, because the signal we're delivering doesn't terminate child
processes, because we'd just reset the signal handler? I think it's fine for
the startup process, because we ask the startup process to shut down with
SIGTERM, which defaults to exiting.
But we do have a few processes that we do ask to shut down with other
signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver,
walsender are asked to shut down using SIGUSR2 IIRC. The only one where that
could be an issue is archiver, I guess?
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c index 6591b5d6a8..1904d21795 100644 --- a/src/backend/storage/ipc/ipc.c +++ b/src/backend/storage/ipc/ipc.c @@ -103,10 +103,6 @@ static int on_proc_exit_index, void proc_exit(int code) { - /* not safe if forked by system(), etc. */ - if (MyProcPid != (int) getpid()) - elog(PANIC, "proc_exit() called in child process"); - /* Clean up everything that must be cleaned up */ proc_exit_prepare(code);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index e9e445bb21..5b663a2997 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -806,10 +806,6 @@ ProcKill(int code, Datum arg)Assert(MyProc != NULL);
- /* not safe if forked by system(), etc. */
- if (MyProc->pid != (int) getpid())
- elog(PANIC, "ProcKill() called in child process");
-
/* Make sure we're out of the sync rep lists */
SyncRepCleanupAtProcExit();@@ -930,10 +926,6 @@ AuxiliaryProcKill(int code, Datum arg)
Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
- /* not safe if forked by system(), etc. */
- if (MyProc->pid != (int) getpid())
- elog(PANIC, "AuxiliaryProcKill() called in child process");
-
auxproc = &AuxiliaryProcs[proctype];Assert(MyProc == auxproc);
I think we should leave these checks. It's awful to debug situations where a
proc gets reused and the cost of the checks is irrelevant. The checks don't
just protect against child processes, they also protect e.g. against calling
those functions twice (we IIRC had cases of that).
Greetings,
Andres Freund
Thanks for taking a look.
On Wed, Nov 22, 2023 at 02:59:45PM -0800, Andres Freund wrote:
On 2023-11-22 15:59:44 -0600, Nathan Bossart wrote:
+/* + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function + * as the handler for all signals. This wrapper handler function checks that + * it is called within a process that the server knows about, and not a + * grandchild process forked by system(3), etc.Similar comment to earlier - the grandchildren bit seems like a dangling
reference. And also too specific - I think we could encounter this in single
user mode as well?Perhaps it could be reframed to "postgres processes, as determined by having
called InitProcessGlobals()"?
Eh, apparently my attempt at being clever didn't pan out. I like your idea
of specifying InitProcessGlobals(), but I might also include "e.g., client
backends", too.
Hm. I wonder if this indicates an issue. Do the preceding changes perhaps
make it more likely that a child process of the startup process could hang
around for longer, because the signal we're delivering doesn't terminate child
processes, because we'd just reset the signal handler? I think it's fine for
the startup process, because we ask the startup process to shut down with
SIGTERM, which defaults to exiting.But we do have a few processes that we do ask to shut down with other
signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver,
walsender are asked to shut down using SIGUSR2 IIRC. The only one where that
could be an issue is archiver, I guess?
This did cross my mind. AFAICT most default handlers already trigger an
exit (including SIGUSR2), and for the ones that don't, we'd just end up in
the same situation as if the signal arrived a moment later after the child
process has installed its own handlers. And postmaster doesn't send
certain signals (e.g., SIGHUP) to the whole process group, so we don't have
the opposite problem where things like reloading configuration files
terminates these child processes.
So, I didn't notice any potential issues. Did you have anything else in
mind?
- /* not safe if forked by system(), etc. */
- if (MyProc->pid != (int) getpid())
- elog(PANIC, "AuxiliaryProcKill() called in child process");
-
auxproc = &AuxiliaryProcs[proctype];Assert(MyProc == auxproc);
I think we should leave these checks. It's awful to debug situations where a
proc gets reused and the cost of the checks is irrelevant. The checks don't
just protect against child processes, they also protect e.g. against calling
those functions twice (we IIRC had cases of that).
Sure, that's no problem.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Here is a new patch set with feedback addressed.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Check-that-MyProcPid-getpid-in-all-signal-handler.patchtext/x-diff; charset=us-asciiDownload+78-3
v3-0002-Centralize-logic-for-restoring-errno-in-signal-ha.patchtext/x-diff; charset=us-asciiDownload+6-80
v3-0003-Revert-Avoid-calling-proc_exit-in-processes-forke.patchtext/x-diff; charset=us-asciiDownload+1-51
Hi,
On 2023-11-28 15:39:55 -0600, Nathan Bossart wrote:
From e4bea5353c2685457545b67396095e9b96156982 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 28 Nov 2023 14:58:20 -0600
Subject: [PATCH v3 1/3] Check that MyProcPid == getpid() in all signal
handlers.In commit 97550c0711, we added a similar check to the SIGTERM
handler for the startup process. This commit adds this check to
all signal handlers installed with pqsignal(). This is done by
using a wrapper function that performs the check before calling the
actual handler.The hope is that this will offer more general protection against
child processes of Postgres backends inadvertently modifying shared
memory due to inherited signal handlers. Another potential
follow-up improvement is to use this wrapper handler function to
restore errno instead of relying on each individual handler
function to do so.This commit makes the changes in commit 97550c0711 obsolete but
leaves reverting it for a follow-up commit.
For a moment I was, wrongly, worried this would break signal handlers we
intentionally inherit from postmaster. It's fine though, because we block
signals in fork_process() until somewhere in InitPostmasterChild(), after
we've called InitProcessGlobals(). But perhaps that should be commented upon
somewhere?
Greetings,
Andres Freund
Hi,
On 2023-11-27 16:16:25 -0600, Nathan Bossart wrote:
On Wed, Nov 22, 2023 at 02:59:45PM -0800, Andres Freund wrote:
Hm. I wonder if this indicates an issue. Do the preceding changes perhaps
make it more likely that a child process of the startup process could hang
around for longer, because the signal we're delivering doesn't terminate child
processes, because we'd just reset the signal handler? I think it's fine for
the startup process, because we ask the startup process to shut down with
SIGTERM, which defaults to exiting.But we do have a few processes that we do ask to shut down with other
signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver,
walsender are asked to shut down using SIGUSR2 IIRC. The only one where that
could be an issue is archiver, I guess?This did cross my mind. AFAICT most default handlers already trigger an
exit (including SIGUSR2), and for the ones that don't, we'd just end up in
the same situation as if the signal arrived a moment later after the child
process has installed its own handlers. And postmaster doesn't send
certain signals (e.g., SIGHUP) to the whole process group, so we don't have
the opposite problem where things like reloading configuration files
terminates these child processes.So, I didn't notice any potential issues. Did you have anything else in
mind?
No, I just was wondering about issues in this area. I couldn't immediately
think of a concrete scenario, so I thought I'd mention it here.
Greetings,
Andres Freund
On Tue, Nov 28, 2023 at 06:37:50PM -0800, Andres Freund wrote:
For a moment I was, wrongly, worried this would break signal handlers we
intentionally inherit from postmaster. It's fine though, because we block
signals in fork_process() until somewhere in InitPostmasterChild(), after
we've called InitProcessGlobals(). But perhaps that should be commented upon
somewhere?
Good call. I expanded on the MyProcPid assertion in wrapper_handler() a
bit.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0001-Check-that-MyProcPid-getpid-in-all-signal-handler.patchtext/x-diff; charset=us-asciiDownload+84-3
v4-0002-Centralize-logic-for-restoring-errno-in-signal-ha.patchtext/x-diff; charset=us-asciiDownload+6-80
v4-0003-Revert-Avoid-calling-proc_exit-in-processes-forke.patchtext/x-diff; charset=us-asciiDownload+1-51
rebased for cfbot
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0001-Check-that-MyProcPid-getpid-in-all-signal-handler.patchtext/x-diff; charset=us-asciiDownload+84-3
v5-0002-Centralize-logic-for-restoring-errno-in-signal-ha.patchtext/x-diff; charset=us-asciiDownload+6-80
v5-0003-Revert-Avoid-calling-proc_exit-in-processes-forke.patchtext/x-diff; charset=us-asciiDownload+1-51
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote:
* Overhead: The wrapper handler calls a function pointer and getpid(),
which AFAICT is a real system call on most platforms. That might not be
a tremendous amount of overhead, but it's not zero, either. I'm
particularly worried about signal-heavy code like synchronous
replication. (Are there other areas that should be tested?) If this is
a concern, perhaps we could allow certain processes to opt out of this
wrapper handler, provided we believe it is unlikely to fork or that the
handler code is safe to run in grandchild processes.
I finally spent some time trying to measure this overhead. Specifically, I
sent many, many SIGUSR2 signals to postmaster, which just uses
dummy_handler(), i.e., does nothing. I was just barely able to get
wrapper_handler() to show up in the first page of 'perf top' in this
extreme case, which leads me to think that the overhead might not be a
problem.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2024-02-06 20:39:41 -0600, Nathan Bossart wrote:
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote:
* Overhead: The wrapper handler calls a function pointer and getpid(),
which AFAICT is a real system call on most platforms. That might not be
a tremendous amount of overhead, but it's not zero, either. I'm
particularly worried about signal-heavy code like synchronous
replication. (Are there other areas that should be tested?) If this is
a concern, perhaps we could allow certain processes to opt out of this
wrapper handler, provided we believe it is unlikely to fork or that the
handler code is safe to run in grandchild processes.I finally spent some time trying to measure this overhead. Specifically, I
sent many, many SIGUSR2 signals to postmaster, which just uses
dummy_handler(), i.e., does nothing. I was just barely able to get
wrapper_handler() to show up in the first page of 'perf top' in this
extreme case, which leads me to think that the overhead might not be a
problem.
That's what I'd expect. Signal delivery is fairly heavyweight, getpid() is one
of the cheapest system calls (IIRC only beat by close() of an invalid FD on
recent-ish linux). If it were to become an issue, we'd much better spend our
time reducing the millions of signals/sec that'd have to involve.
Greetings,
Andres Freund
On Tue, Feb 06, 2024 at 06:48:53PM -0800, Andres Freund wrote:
On 2024-02-06 20:39:41 -0600, Nathan Bossart wrote:
I finally spent some time trying to measure this overhead. Specifically, I
sent many, many SIGUSR2 signals to postmaster, which just uses
dummy_handler(), i.e., does nothing. I was just barely able to get
wrapper_handler() to show up in the first page of 'perf top' in this
extreme case, which leads me to think that the overhead might not be a
problem.That's what I'd expect. Signal delivery is fairly heavyweight, getpid() is one
of the cheapest system calls (IIRC only beat by close() of an invalid FD on
recent-ish linux). If it were to become an issue, we'd much better spend our
time reducing the millions of signals/sec that'd have to involve.
Indeed.
I'd like to get this committed (to HEAD only) in the next few weeks. TBH
I'm not wild about the weird caveats (e.g., race conditions when pqsignal()
is called within a signal handler), but I also think it is unlikely that
they cause any issues in practice. Please do let me know if you have any
concerns about this.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Sorry for the noise.
On Wed, Feb 07, 2024 at 11:06:50AM -0600, Nathan Bossart wrote:
I'd like to get this committed (to HEAD only) in the next few weeks. TBH
I'm not wild about the weird caveats (e.g., race conditions when pqsignal()
is called within a signal handler), but I also think it is unlikely that
they cause any issues in practice. Please do let me know if you have any
concerns about this.
Perhaps we should add a file global bool that is only set during
wrapper_handler(). Then we could Assert() or elog(ERROR, ...) if
pqsignal() is called with it set.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2024-02-07 11:15:54 -0600, Nathan Bossart wrote:
On Wed, Feb 07, 2024 at 11:06:50AM -0600, Nathan Bossart wrote:
I'd like to get this committed (to HEAD only) in the next few weeks. TBH
I'm not wild about the weird caveats (e.g., race conditions when pqsignal()
is called within a signal handler), but I also think it is unlikely that
they cause any issues in practice. Please do let me know if you have any
concerns about this.
I don't.
Perhaps we should add a file global bool that is only set during
wrapper_handler(). Then we could Assert() or elog(ERROR, ...) if
pqsignal() is called with it set.
In older branches that might have been harder (due to forking from a signal
handler and non-fatal errors thrown from signal handlers), but these days I
think that should work.
FWIW, I don't think elog(ERROR) would be appropriate, that'd be jumping out of
a signal handler :)
If it were just for the purpose of avoiding the issue you brought up it might
not quite be worth it - but there are a lot of things we want to forbid in a
signal handler. Memory allocations, acquiring locks, throwing non-panic
errors, etc. That's one of the main reasons I like a common wrapper signal
handler.
Which reminded me of /messages/by-id/87msstvper.fsf@163.com - the set of
things we want to forbid are similar. I'm not sure there's really room to
harmonize things, but I thought I'd raise it.
Perhaps we should make the state a bitmap and have a single
AssertNotInState(HOLDING_SPINLOCK | IN_SIGNALHANDLER)
Greetings,
Andres Freund
On Wed, Feb 07, 2024 at 10:40:50AM -0800, Andres Freund wrote:
On 2024-02-07 11:15:54 -0600, Nathan Bossart wrote:
Perhaps we should add a file global bool that is only set during
wrapper_handler(). Then we could Assert() or elog(ERROR, ...) if
pqsignal() is called with it set.In older branches that might have been harder (due to forking from a signal
handler and non-fatal errors thrown from signal handlers), but these days I
think that should work.FWIW, I don't think elog(ERROR) would be appropriate, that'd be jumping out of
a signal handler :)
*facepalm* Yes.
If it were just for the purpose of avoiding the issue you brought up it might
not quite be worth it - but there are a lot of things we want to forbid in a
signal handler. Memory allocations, acquiring locks, throwing non-panic
errors, etc. That's one of the main reasons I like a common wrapper signal
handler.Which reminded me of /messages/by-id/87msstvper.fsf@163.com - the set of
things we want to forbid are similar. I'm not sure there's really room to
harmonize things, but I thought I'd raise it.Perhaps we should make the state a bitmap and have a single
AssertNotInState(HOLDING_SPINLOCK | IN_SIGNALHANDLER)
Seems worth a try. I'll go ahead and proceed with these patches and leave
this improvement for another thread.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Dec 18, 2023 at 11:32:47AM -0600, Nathan Bossart wrote:
rebased for cfbot
I took a look over each of these. +1 for all. Thank you.
On Wed, Feb 14, 2024 at 11:55:43AM -0800, Noah Misch wrote:
I took a look over each of these. +1 for all. Thank you.
Committed. Thanks!
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com