[PATCH] Optimize ProcSignal to avoid redundant SIGUSR1 signals
Hi hackers,
In the work of trying to optimize async.c, I came across a surprisingly
seemingly low hanging fruit in procsignal.c, to $subject.
This optimization improves not only LISTEN/NOTIFY,
but procsignal.c in general, for all ProcSignalReasons,
by avoiding to send redundant signals in the case
when the backend hasn't received and handled them yet.
--- PATCH ---
Previously, ProcSignal used an array of volatile sig_atomic_t flags, one
per signal reason. A sender would set a flag and then unconditionally
send a SIGUSR1 to the target process. This could result in a storm of
redundant signals if multiple processes signaled the same target before
it had a chance to run its signal handler.
Change this to use a single pg_atomic_uint32 as a bitmask of pending
signals. When sending, use pg_atomic_fetch_or_u32 to set the appropriate
signal bit and inspect the prior state of the flags word. Then only
issue a SIGUSR1 if the previous flags state was zero. This works safely
because the receiving backend's signal handler atomically resets the
entire bitmask upon receipt, thus processing all pending signals at
once. Consequently, subsequent senders seeing a nonzero prior state know
a signal is already in flight, significantly reducing redundant
kill(pid, SIGUSR1) system calls under heavy contention.
On the receiving end, the SIGUSR1 handler now atomically fetches and
clears the entire bitmask with a single pg_atomic_exchange_u32, then
calls the appropriate sub-handlers.
The further optimization to only check if the old flags word was zero is
due to Andreas Karlsson.
--- BENCHMARK ---
The attached benchmark script does LISTEN on one connection,
and then uses pgbench to send NOTIFY on a varying number of
connections and jobs, to cause a high procsignal load.
I've run the benchmark on my MacBook Pro M3 Max,
10 seconds per run, 3 runs.
Connections=Jobs | TPS (master) | TPS (patch) | Relative Diff (%) | StdDev (master) | StdDev (patch)
------------------+--------------+-------------+-------------------+-----------------+----------------
1 | 118833 | 118902 | 0.06% | 484 | 520
2 | 156005 | 194873 | 24.91% | 3145 | 631
4 | 177351 | 190672 | 7.51% | 4305 | 1439
8 | 116597 | 124793 | 7.03% | 1549 | 1011
16 | 40835 | 113312 | 177.49% | 2695 | 1155
32 | 37940 | 108469 | 185.90% | 2533 | 487
64 | 35495 | 104994 | 195.80% | 1837 | 318
128 | 40193 | 100246 | 149.41% | 2254 | 393
(8 rows)
Raw Data Summary:
Version | Connections | Runs | Min TPS | Max TPS | Avg TPS
----------+-------------+------+---------+---------+---------
master | 1 | 3 | 118274 | 119119 | 118833
master | 2 | 3 | 152803 | 159090 | 156005
master | 4 | 3 | 174381 | 182288 | 177351
master | 8 | 3 | 115021 | 118117 | 116597
master | 16 | 3 | 39048 | 43935 | 40835
master | 32 | 3 | 35754 | 40716 | 37940
master | 64 | 3 | 33417 | 36906 | 35495
master | 128 | 3 | 37925 | 42433 | 40193
patch-v1 | 1 | 3 | 118589 | 119503 | 118902
patch-v1 | 2 | 3 | 194204 | 195457 | 194873
patch-v1 | 4 | 3 | 189771 | 192332 | 190672
patch-v1 | 8 | 3 | 123929 | 125904 | 124793
patch-v1 | 16 | 3 | 112328 | 114584 | 113312
patch-v1 | 32 | 3 | 107975 | 108949 | 108469
patch-v1 | 64 | 3 | 104649 | 105275 | 104994
patch-v1 | 128 | 3 | 99792 | 100479 | 100246
(16 rows)
/Joel
Attachments:
On Wed, Jul 23, 2025 at 8:08 AM Joel Jacobson <joel@compiler.org> wrote:
Previously, ProcSignal used an array of volatile sig_atomic_t flags, one
per signal reason. A sender would set a flag and then unconditionally
send a SIGUSR1 to the target process. This could result in a storm of
redundant signals if multiple processes signaled the same target before
it had a chance to run its signal handler.
It's a good idea with great results!
FWIW the "new interrupts systems" patch set[1]/messages/by-id/CA+hUKG+3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ@mail.gmail.com also contains this
idea. But it goes further, because it also removes all the
corresponding "pending" local variables: notice how
HandleNotifyInterrupt() really just sets notifyInterruptPending = true
and does SetLatch(MyLatch). So instead of having
CHECK_FOR_INTERRUPTS() check all those local variables, we can merge
"procsignals" with "interrupts", and have CHECK_FOR_INTERRUPTS()
itself read that atomic word directly, skipping the middleman and
deleting the whole ProcSignal system. Originally I proposed that the
sending backend would now do SendInterrupt() directly, essentially
moving the contents of the signal handler into the sender and setting
the receiver's latch directly. But then Heikki went further and
proposed deleting latches too, and having "interrupts" as our only
remaining inter-backend wakeup abstraction, with one of the interrupt
flags used as a "general" one that replaces all the places that use
SetLatch() directly today.
The end state could include a WaitEventSetWait() implementation that
waits for that single remaining atomic word with a (multiplexed) futex
wait, also removing the "maybe sleeping" dance because futexes already
have ideal protection against races. Or you can send a byte to a pipe
or post a custom event to an io_uring or kqueue or whatever.
That's part of a larger project to get rid of all inter-backend
signals, on the pathway to multi-threaded backends. Some things along
the way have included making it true that all real
IPC-request-handling work is done in CHECK_FOR_INTERRUPTS(), not
signal handlers (ie making it all "cooperative"), and making it safe
to use uint32_t atomics at all in signal handlers as you're doing here
(previously they were allowed to be emulated with spinlocks), as an
intermediate step because in the short term we still have signals for
timers.
I'm planning to post a rebase of that work with some improvements
soon. Hmm, I wonder if there is any sense in restructuring it so that
your patch becomes an incremental stepping stone. I think one of our
problems might have been trying to change too many things at once.
What you're doing is certainly independently good, but I haven't
studied your patch too closely yet.
[1]: /messages/by-id/CA+hUKG+3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ@mail.gmail.com
On Tue, Jul 22, 2025, at 23:54, Thomas Munro wrote:
On Wed, Jul 23, 2025 at 8:08 AM Joel Jacobson <joel@compiler.org> wrote:
Previously, ProcSignal used an array of volatile sig_atomic_t flags, one
per signal reason. A sender would set a flag and then unconditionally
send a SIGUSR1 to the target process. This could result in a storm of
redundant signals if multiple processes signaled the same target before
it had a chance to run its signal handler.It's a good idea with great results!
FWIW the "new interrupts systems" patch set[1] also contains this
idea. But it goes further, because it also removes all the
corresponding "pending" local variables: notice how
HandleNotifyInterrupt() really just sets notifyInterruptPending = true
and does SetLatch(MyLatch). So instead of having
CHECK_FOR_INTERRUPTS() check all those local variables, we can merge
"procsignals" with "interrupts", and have CHECK_FOR_INTERRUPTS()
itself read that atomic word directly, skipping the middleman and
deleting the whole ProcSignal system. Originally I proposed that the
sending backend would now do SendInterrupt() directly, essentially
moving the contents of the signal handler into the sender and setting
the receiver's latch directly. But then Heikki went further and
proposed deleting latches too, and having "interrupts" as our only
remaining inter-backend wakeup abstraction, with one of the interrupt
flags used as a "general" one that replaces all the places that use
SetLatch() directly today.The end state could include a WaitEventSetWait() implementation that
waits for that single remaining atomic word with a (multiplexed) futex
wait, also removing the "maybe sleeping" dance because futexes already
have ideal protection against races. Or you can send a byte to a pipe
or post a custom event to an io_uring or kqueue or whatever.That's part of a larger project to get rid of all inter-backend
signals, on the pathway to multi-threaded backends. Some things along
the way have included making it true that all real
IPC-request-handling work is done in CHECK_FOR_INTERRUPTS(), not
signal handlers (ie making it all "cooperative"), and making it safe
to use uint32_t atomics at all in signal handlers as you're doing here
(previously they were allowed to be emulated with spinlocks), as an
intermediate step because in the short term we still have signals for
timers.I'm planning to post a rebase of that work with some improvements
soon. Hmm, I wonder if there is any sense in restructuring it so that
your patch becomes an incremental stepping stone. I think one of our
problems might have been trying to change too many things at once.
What you're doing is certainly independently good, but I haven't
studied your patch too closely yet.[1]
/messages/by-id/CA+hUKG+3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ@mail.gmail.com
Great to hear you're also working in this area!
I like the idea of incremental stepping stones,
I can certainly help both reviewing and participating in that work.
To give you an idea of the size of my patch:
src/backend/storage/ipc/procsignal.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------
src/include/storage/procsignal.h | 3 +++
2 files changed, 57 insertions(+), 49 deletions(-)
When you rebase your patchset, I can try to see if there is a natural fit.
/Joel