Interrupts vs signals

Started by Thomas Munroover 4 years ago56 messageshackers

thomas.munro@gmail.com

over 4 years ago

Hi,

I wonder if we really need signals to implement interrupts. Given
that they are non-preemptive/cooperative (work happens at the next
CFI()), why not just use shared memory flags and latches? That skips
a bunch of code, global variables and scary warnings about programming
in signal handlers.

I sketched out some code to try that a few months back, while
speculating about bite-sized subproblems that would come up if each
backend is, one day, a thread.

There are several other conditions that are also handled by
CHECK_FOR_INTERRUPTS(), but are not triggered by other backends
sending signals, or are set by other signal handlers (SIGALRM,
SIGQUIT). One idea is to convert those into "procsignals" too, for
consistency. In the attached, they can be set (ie by the same
backend) with ProcSignalRaise(), but it's possible that in future we
might have a reason for another backend to set them too, so it seems
like a good idea to have a single system, effectively merging the
concepts of "procsignals" and "interrupts".

There are still a few more ad hoc (non-ProcSignal) uses of SIGUSR1 in
the tree. For one thing, we don't allow the postmaster to set
latches; if we gave up on that rule, we wouldn't need the bgworker
please-signal-me thing. Also the logical replication launcher does
the same sort of thing for no apparent reason. Changed in the
attached -- mainly so I could demonstrate that check-world passes with
SIGUSR1 ignored.

The attached is only experiment grade code: in particular, I didn't
quite untangle the recovery conflict flags properly. It's also doing
function calls where some kind of fast inlined magic is probably
required, and I probably have a few other details wrong, but I figured
it was good enough to demonstrate the concept.

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#1)

Re: Interrupts vs signals

Hi,

On 2021-10-21 07:55:54 +1300, Thomas Munro wrote:

I wonder if we really need signals to implement interrupts. Given
that they are non-preemptive/cooperative (work happens at the next
CFI()), why not just use shared memory flags and latches? That skips
a bunch of code, global variables and scary warnings about programming
in signal handlers.

Depending on how you implement them, one difference could be whether / when
"slow" system calls (recv, poll, etc) are interrupted.

Another is that that signal handling provides a memory barrier in the
receiving process. For things that rarely change (like most interrupts), it
can be more efficient to move the cost of that out-of-line, instead of
incurring them on every check.

One nice thing of putting the state variables into shared memory would be that
that'd allow to see the pending interrupts of other backends for debugging
purposes.

One idea is to convert those into "procsignals" too, for
consistency. In the attached, they can be set (ie by the same
backend) with ProcSignalRaise(), but it's possible that in future we
might have a reason for another backend to set them too, so it seems
like a good idea to have a single system, effectively merging the
concepts of "procsignals" and "interrupts".

This seems a bit confusing to me. For one, we need to have interrupts working
before we have a proc, IIRC. But leaving details like that aside, it just
seems a bit backwards to me. I'm on board with other backends directly setting
interrupt flags, but it seems to me that the procsignal stuff should be
"client" of the process-local interrupt infrastructure, rather than the other
way round.

+bool
+ProcSignalAnyPending(void)
+{
+	volatile ProcSignalSlot *slot = MyProcSignalSlot;

-	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+	/* XXX make this static inline? */
+	/* XXX point to a dummy entry instead of using NULL to avoid a branch */
+	return slot && slot->pss_signaled;
+}

ISTM it might be easier to make this stuff efficiently race-free if we made
this a count of pending operations.

@@ -3131,12 +3124,13 @@ ProcessInterrupts(void)
/* OK to accept any interrupts now? */
if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
return;
-	InterruptPending = false;
+	ProcSignalClearAnyPending();
+
+	pg_read_barrier();

-	if (ProcDiePending)
+	if (ProcSignalConsume(PROCSIG_DIE))
{

I think making all of these checks into function calls isn't great. How about
making the set of pending signals a bitmask? That'd allow to efficiently check
a bunch of interrupts together and even where not, it'd just be a single test
of the mask, likely already in a register.

Greetings,

Andres Freund

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#2)

Re: Interrupts vs signals

On Thu, Oct 21, 2021 at 8:27 AM Andres Freund <andres@anarazel.de> wrote:

On 2021-10-21 07:55:54 +1300, Thomas Munro wrote:

I wonder if we really need signals to implement interrupts. Given
that they are non-preemptive/cooperative (work happens at the next
CFI()), why not just use shared memory flags and latches? That skips
a bunch of code, global variables and scary warnings about programming
in signal handlers.

Depending on how you implement them, one difference could be whether / when
"slow" system calls (recv, poll, etc) are interrupted.

Hopefully by now all such waits are implemented with latch.c facilities?

Another is that that signal handling provides a memory barrier in the
receiving process. For things that rarely change (like most interrupts), it
can be more efficient to move the cost of that out-of-line, instead of
incurring them on every check.

Agreed, but in this experiment I was trying out the idea that a memory
barrier is not really needed at all, unless you're about to go to
sleep. We already insert one of those before a latch wait. That is,
if we see !set->latch->is_set, we do pg_memory_barrier() and check
again, before sleeping, so your next CFI must see the flag. For
computation loops (sort, hash, query execution, ...), I speculate that
a relaxed read of memory is fine... you'll see the flag pretty soon,
and you certainly won't be allowed to finish your computation and go
to sleep.

One nice thing of putting the state variables into shared memory would be that
that'd allow to see the pending interrupts of other backends for debugging
purposes.

One idea is to convert those into "procsignals" too, for
consistency. In the attached, they can be set (ie by the same
backend) with ProcSignalRaise(), but it's possible that in future we
might have a reason for another backend to set them too, so it seems
like a good idea to have a single system, effectively merging the
concepts of "procsignals" and "interrupts".

This seems a bit confusing to me. For one, we need to have interrupts working
before we have a proc, IIRC. But leaving details like that aside, it just
seems a bit backwards to me. I'm on board with other backends directly setting
interrupt flags, but it seems to me that the procsignal stuff should be
"client" of the process-local interrupt infrastructure, rather than the other
way round.

Hmm. Yeah, I see your point. But I can also think of some arguments
for merging the concepts of local and shared interrupts; see below.

In this new sketch, I tried doing it the other way around. That is,
completely removing the concept of "ProcSignal", leaving only
"Interrupts". Initially, MyPendingInterrupts points to something
private, and once you're connected to shared memory it points to
MyProc->pending_interrupts.

+bool
+ProcSignalAnyPending(void)
+{
+     volatile ProcSignalSlot *slot = MyProcSignalSlot;

-     if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
-             RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+     /* XXX make this static inline? */
+     /* XXX point to a dummy entry instead of using NULL to avoid a branch */
+     return slot && slot->pss_signaled;
+}

ISTM it might be easier to make this stuff efficiently race-free if we made
this a count of pending operations.

Hmm, with a unified interrupt system and a bitmap it's not necessary
to have a separate flag/counter at all.

@@ -3131,12 +3124,13 @@ ProcessInterrupts(void)
/* OK to accept any interrupts now? */
if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
return;
-     InterruptPending = false;
+     ProcSignalClearAnyPending();
+
+     pg_read_barrier();
-     if (ProcDiePending)
+     if (ProcSignalConsume(PROCSIG_DIE))
{
I think making all of these checks into function calls isn't great. How about
making the set of pending signals a bitmask? That'd allow to efficiently check
a bunch of interrupts together and even where not, it'd just be a single test
of the mask, likely already in a register.

+1.

Some assorted notes:

1. Aside from doing interrupts in this new way, I also have the
postmaster setting latches (!) instead of sending ad hoc SIGUSR1 here
and there. My main reason for doing that was to be able to chase out
all reasons to register a SIGUSR1 handler, so I could prove that
check-world passes. I like it, though. But maybe it's really a
separate topic.

2. I moved this stuff into interrupt.{h,c}. There is nothing left in
procsignal.c except code relating to ProcSignalBarrier. I guess that
thing could use another name, anyway. It's a ...
SystemInterruptBarrier?

3. Child-level SIGINT and SIGTERM handlers probably aren't really
necessary, either: maybe the sender could do
InterruptSend(INTERRUPT_{DIE,QUERY_CANCEL}, pgprocno) instead? But
perhaps people are attached to being able to send those signals from
external programs directly to backends.

4. Like the above, a SIGALRM handler might need to do eg
InterruptRaise(INTERRUPT_STATEMENT_TIMEOUT). That's a problem for
systems using spinlocks (self-deadlock against user context in
InterruptRaise()), so I'd need to come up with some flag protocol for
dinosaurs to make that safe, OR revert to having these "local only"
interrupts done with separate flags, as you were getting at earlier.

5. The reason I prefer to put currently "local only" interrupts into
the same atomic system is that I speculate that ultimately all of the
backend-level signal handlers won't be needed. They all fall into
three categories: (1) could be replaced with these interrupts
directly, (2) could be replaced by the new timer infrastructure that
multithreaded postgres would need to have to deliver interrupts to the
right recipients, (3) are quickdie and can be handled at the
containing process level. Then the only signal handlers left are top
level external ones.

But perhaps you're right and I should try reintroducing separate local
interrupts for now. I dunno, I like the simplicity of the unified
system; if only it weren't for those spinlock-backed atomics.

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Thomas Munro (#3)

Re: Interrupts vs signals

On Thu, Nov 11, 2021 at 12:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Depending on how you implement them, one difference could be whether / when
"slow" system calls (recv, poll, etc) are interrupted.

Hopefully by now all such waits are implemented with latch.c facilities?

Do read(), write(), etc. count? Because we certainly have raw calls to
those functions in lots of places.

--
Robert Haas
EDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#4)

Re: Interrupts vs signals

Hi,

On 2021-11-11 09:06:01 -0500, Robert Haas wrote:

On Thu, Nov 11, 2021 at 12:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Depending on how you implement them, one difference could be whether / when
"slow" system calls (recv, poll, etc) are interrupted.

Hopefully by now all such waits are implemented with latch.c facilities?

Do read(), write(), etc. count? Because we certainly have raw calls to
those functions in lots of places.

They can count, but only when used for network sockets or pipes ("slow
devices" or whatever the posix language is). Disk IO doesn't count as that. So
I don't think it'd be a huge issue.

Greetings,

Andres Freund

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#5)

Re: Interrupts vs signals

On Thu, Nov 11, 2021 at 2:50 PM Andres Freund <andres@anarazel.de> wrote:

They can count, but only when used for network sockets or pipes ("slow
devices" or whatever the posix language is). Disk IO doesn't count as that. So
I don't think it'd be a huge issue.

Somehow the idea that the network is a slow device and the disk a fast
one does not seem like it's necessarily accurate on modern hardware,
but I guess the spec is what it is.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Robert Haas (#6)

Re: Interrupts vs signals

On Fri, Nov 12, 2021 at 9:24 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 11, 2021 at 2:50 PM Andres Freund <andres@anarazel.de> wrote:

They can count, but only when used for network sockets or pipes ("slow
devices" or whatever the posix language is). Disk IO doesn't count as that. So
I don't think it'd be a huge issue.

Somehow the idea that the network is a slow device and the disk a fast
one does not seem like it's necessarily accurate on modern hardware,
but I guess the spec is what it is.

[Somehow I managed to reply to Robert only; let me try that again,
this time to the list...]

Network filesystems have in the past been confusing because they're
both disk-like and network-like, and also slow as !@#$, which is why
there have been mount point options like "intr", "nointr" (now ignored
on Linux) to control what happens if you receive an async signal
during a sleepy read/write. But even if you had some kind of
Deathstation 9000 that had a switch on the front panel that ignores
SA_RESTART and produces EINTR for disk I/O when a signal arrives,
PostgreSQL already doesn't work today. Our pread() and pwrite() paths
for data and WAL don't not have a EINTR loops or
CHECK_FOR_INTERRUPTS() (we just can't take interrupts in the middle of
eg a synchronous write), so I think we'd produce an ERROR or PANIC.
So I think disk I/O is irrelevant, and network/pipe I/O is already
handled everywhere via latch.c facilities.

If there are any eg blocking reads on a socket in PostgreSQL, we
should fix that to use latch.c non-blocking techniques, because such a
place is already a place that ignores postmaster death and interrupts.
To be more precise: such a place could of course wake up for EINTR on
SIGUSR1 from procsignal.c, and that would no longer happen with my
patch, but if we're relying on that anywhere, it's dangerous and
unreliable. If SIGUSR1 is delivered right before you enter a blocking
read(), you'll sleep waiting for the socket or whatever. That's
precisely the problem that latch.c solves, and why it's already a bug
if there are such places.

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#7)

Re: Interrupts vs signals

Here's an updated version of this patch.

The main idea is that SendProcSignal(pid, PROCSIGNAL_XXX, procno)
becomes SendInterrupt(INTERRUPT_XXX, procno), and all the pending
interrupt global variables and pss_procsignalFlags[] go away, along
with the SIGUSR1 handler. The interrupts are compressed into a single
bitmap. See commit message for more details.

The patch is failing on Windows CI for reasons I haven't debugged yet,
but I wanted to share what I have so far. Work in progress!

Here is my attempt to survey the use of signals and write down what I
think we might do about them all so far, to give the context for this
patch:

https://wiki.postgresql.org/wiki/Signals

Comments, corrections, edits very welcome.

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Thomas Munro (#8)

Re: Interrupts vs signals

On 08/07/2024 05:56, Thomas Munro wrote:

Here's an updated version of this patch.

The main idea is that SendProcSignal(pid, PROCSIGNAL_XXX, procno)
becomes SendInterrupt(INTERRUPT_XXX, procno), and all the pending
interrupt global variables and pss_procsignalFlags[] go away, along
with the SIGUSR1 handler. The interrupts are compressed into a single
bitmap. See commit message for more details.

The patch is failing on Windows CI for reasons I haven't debugged yet,
but I wanted to share what I have so far. Work in progress!

Here is my attempt to survey the use of signals and write down what I
think we might do about them all so far, to give the context for this
patch:

https://wiki.postgresql.org/wiki/Signals

Comments, corrections, edits very welcome.

Nice, thanks!

Background worker state notifications are also changed from raw
kill(SIGUSR1) to SetLatch(). That means that SetLatch() is now called
from the postmaster. The main purpose of including that change is to be
able to remove procsignal_sigusr1_handler completely and set SIGUSR1 to
SIG_IGN, and show the system working.

XXX Do we need to invent SetLatchRobust() that doesn't trust anything in
shared memory, to be able to set latches from the postmaster?

The patch actually does both: it still does kill(SIGUSR1) and also sets
the latch.

I think it would be nice if RegisterDynamicBackgroundWorker() had a
"bool notify_me" argument, instead of requiring the caller to set
"bgw_notify_pid = MyProcPid" before the call. That's a
backwards-compatibility break, but maybe we should bite the bullet and
do it. Or we could do this in RegisterDynamicBackgroundWorker():

if (worker->bgw_notify_pid == MyProcPid)
worker->bgw_notify_pgprocno = MyProcNumber;

I think we can forbid setting pgw_notify_pid to anything else than 0 or
MyProcPid.

A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it
can do any damage if you called it on a pointer to garbage, except if
the pointer itself is bogus, then just dereferencing it an cause a
segfault. So it would be nice to have a version specifically designed
with that in mind. For example, it could assume that the Latch's pid is
never legally equal to MyProcPid, because postmaster cannot own any latches.

Another approach would be to move the responsibility of background
worker state notifications out of postmaster completely. When a new
background worker is launched, the worker process itself could send the
notification that it has started. And similarly, when a worker exits, it
could send the notification just before exiting. There's a little race
condition with exiting: if a process is waiting for the bgworker to
exit, and launches a new worker immediately when the old one exits,
there will be a brief period when the old and new process are alive at
the same time. The old worker wouldn't be doing anything interesting
anymore since it's exiting, but it still counts towards
max_worker_processes, so launching the new process might fail because of
hitting the limit. Maybe we should just bump up max_worker_processes. Or
postmaster could check PMChildFlags and not count processes that have
already deregistered from PMChildFlags towards the limit.

-volatile uint32 InterruptHoldoffCount = 0;
-volatile uint32 QueryCancelHoldoffCount = 0;
-volatile uint32 CritSectionCount = 0;
+uint32 InterruptHoldoffCount = 0;
+uint32 QueryCancelHoldoffCount = 0;
+uint32 CritSectionCount = 0;

I wondered if these are used in PG_TRY-CATCH blocks in a way that would
still require volatile. I couldn't find any such usage by some quick
grepping, so I think we're good, but I thought I'd mention it.

+/*
+ * The set of "standard" interrupts that CHECK_FOR_INTERRUPTS() and
+ * ProcessInterrupts() handle.  These perform work that is safe to run whenever
+ * interrupts are not "held".  Other kinds of interrupts are only handled at
+ * more restricted times.
+ */
+#define INTERRUPT_STANDARD_MASK							   \

Some interrupts are missing from this mask:

- INTERRUPT_PARALLEL_APPLY_MESSAGE
- INTERRUPT_IDLE_STATS_UPDATE_TIMEOUT
- INTERRUPT_SINVAL_CATCHUP
- INTERRUPT_NOTIFY

Is that on purpose?

-/*
- * Because backends sitting idle will not be reading sinval events, we
- * need a way to give an idle backend a swift kick in the rear and make
- * it catch up before the sinval queue overflows and forces it to go
- * through a cache reset exercise. This is done by sending
- * PROCSIG_CATCHUP_INTERRUPT to any backend that gets too far behind.
- *
- * The signal handler will set an interrupt pending flag and will set the
- * processes latch. Whenever starting to read from the client, or when
- * interrupted while doing so, ProcessClientReadInterrupt() will call
- * ProcessCatchupEvent().
- */
-volatile sig_atomic_t catchupInterruptPending = false;

Would be nice to move that comment somewhere else rather than remove it
completely.

--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -444,6 +444,10 @@ InitProcess(void)
OwnLatch(&MyProc->procLatch);
SwitchToSharedLatch();

+	/*We're now ready to accept interrupts from other processes. */
+	pg_atomic_init_u32(&MyProc->pending_interrupts, 0);
+	SwitchToSharedInterrupts();
+
/* now that we have a proc, report wait events to shared memory */
pgstat_set_wait_event_storage(&MyProc->wait_event_info);

@@ -611,6 +615,9 @@ InitAuxiliaryProcess(void)
OwnLatch(&MyProc->procLatch);
SwitchToSharedLatch();

+	/* We're now ready to accept interrupts from other processes. */
+	SwitchToSharedInterrupts();
+
/* now that we have a proc, report wait events to shared memory */
pgstat_set_wait_event_storage(&MyProc->wait_event_info);

Is there a reason for the different initialization between a regular
backend and aux process?

+/*
+ * Switch to shared memory interrupts.  Other backends can send interrupts
+ * to this one if they know its ProcNumber.
+ */
+void
+SwitchToSharedInterrupts(void)
+{
+	pg_atomic_fetch_or_u32(&MyProc->pending_interrupts, pg_atomic_read_u32(MyPendingInterrupts));
+	MyPendingInterrupts = &MyProc->pending_interrupts;
+}

Hmm, I think there's a race condition here (and similarly in
SwitchToLocalInterrupts), if the signal handler runs between the
pg_atomic_fetch_or_u32, and changing MyPendingInterrupts. Maybe
something like this instead:

MyPendingInterrupts = &MyProc->pending_interrupts;
pg_memory_barrier();
pg_atomic_fetch_or_u32(&MyProc->pending_interrupts,
pg_atomic_read_u32(LocalPendingInterrupts));

And perhaps this should also clear LocalPendingInterrupts, just to be tidy.

@@ -138,7 +139,8 @@
typedef struct ProcState
{
/* procPid is zero in an inactive ProcState array entry. */
-	pid_t		procPid;		/* PID of backend, for signaling */
+	pid_t		procPid;		/* pid of backend */
+	ProcNumber	pgprocno;		/* for sending interrupts */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
int			nextMsgNum;		/* next message number to read */
bool		resetState;		/* backend needs to reset its state */

We can easily remove procPid altogether now that we have pgprocno here.
Similarly with the pid/pgprocno in ReplicationSlot and WalSndState.

--
Heikki Linnakangas
Neon (https://neon.tech)

#10

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#9)

Re: Interrupts vs signals

On Mon, Jul 8, 2024 at 5:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Another approach would be to move the responsibility of background
worker state notifications out of postmaster completely. When a new
background worker is launched, the worker process itself could send the
notification that it has started. And similarly, when a worker exits, it
could send the notification just before exiting. There's a little race
condition with exiting: if a process is waiting for the bgworker to
exit, and launches a new worker immediately when the old one exits,
there will be a brief period when the old and new process are alive at
the same time. The old worker wouldn't be doing anything interesting
anymore since it's exiting, but it still counts towards
max_worker_processes, so launching the new process might fail because of
hitting the limit. Maybe we should just bump up max_worker_processes. Or
postmaster could check PMChildFlags and not count processes that have
already deregistered from PMChildFlags towards the limit.

I can testify that the current system is the result of a lot of trial
and error. I'm not saying it can't be made better, but my initial
attempts at getting this to work (back in the 9.4 era) resembled what
you proposed here, were consequently a lot simpler than what we have
now, and also did not work. Race conditions like you mention here were
part of that. Another consideration is that fork() can fail, and in
that case, the process that tried to register the new background
worker needs to find out that the background worker won't ever be
starting. Yet another problem is that, even if fork() succeeds, the
new process might fail before it executes any of our code e.g. because
it seg faults very early, a case that actually happened to me -
inadvertently - while I was testing these facilities. I ended up
deciding that we can't rely on the new process to do anything until
it's given us some signal that it is alive and able to carry out its
duties. If it dies before telling us that, or never starts in the
first place, we have to have some other way of finding that out, and
it's difficult to see how that can happen without postmaster
involvement.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#9)

Re: Interrupts vs signals

On Mon, Jul 8, 2024 at 9:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

The patch actually does both: it still does kill(SIGUSR1) and also sets
the latch.

Yeah, I had some ideas about supporting old extension code that really
wanted a SIGUSR1, but on reflection, the only reason anyone ever wants
that is so that sigusr1_handler can SetLatch(), which pairs with
WaitLatch() in WaitForBackgroundWorker*(). Let's go all the way and
assume that.

I think it would be nice if RegisterDynamicBackgroundWorker() had a
"bool notify_me" argument, instead of requiring the caller to set
"bgw_notify_pid = MyProcPid" before the call. That's a
backwards-compatibility break, but maybe we should bite the bullet and
do it. Or we could do this in RegisterDynamicBackgroundWorker():

if (worker->bgw_notify_pid == MyProcPid)
worker->bgw_notify_pgprocno = MyProcNumber;

I think we can forbid setting pgw_notify_pid to anything else than 0 or
MyProcPid.

Another idea: we could keep the bgw_notify_pid field around for a
while, documented as unused and due to be removed in future. We could
automatically capture the caller's proc number. So you get latch
wakeups by default, which I expect many people want, and most people
could cope with even if they don't want them. If you really really
don't want them, you could set a new flag BGW_NO_NOTIFY.

I have now done this part of the change in a new first patch. This
particular use of kill(SIGUSR1) is separate from the ProcSignal
removal, it's just that it relies on ProcSignal's handler's default
action of calling SetLatch(). It's needed so the ProcSignal-ectomy
can fully delete sigusr1_handler(), but it's not directly the same
thing, so it seemed good to split the patch.

A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it
can do any damage if you called it on a pointer to garbage, except if
the pointer itself is bogus, then just dereferencing it an cause a
segfault. So it would be nice to have a version specifically designed
with that in mind. For example, it could assume that the Latch's pid is
never legally equal to MyProcPid, because postmaster cannot own any latches.

Yeah I'm starting to think that all we need to do here is range-check
the proc number. Here's a version that adds:
ProcSetLatch(proc_number). Another idea would be for SetLatch(latch)
to sanitise the address of a latch, ie its offset and range.

What the user really wants to be able to do with this bgworker API, I
think, is wait for a given a handle, which could find a condition
variable + generation in the slot, so that we don't have to register
any proc numbers anywhere until we're actually waiting. But *clearly*
the postmaster can't use the condition variable API without risking
following corrupted pointers in shared memory. Whereas AFAICT
ProcSetLatch() from the patched postmaster can't really be corrupted
in any new way that it couldn't already be corrupted in master (where
it runs in the target process), if we're just a bit paranoid about how
we find our way to the latch.

Receiving latch wakeups in the postmaster might be another question,
but I don't think we need to confront that question just yet.

-volatile uint32 InterruptHoldoffCount = 0;
-volatile uint32 QueryCancelHoldoffCount = 0;
-volatile uint32 CritSectionCount = 0;
+uint32 InterruptHoldoffCount = 0;
+uint32 QueryCancelHoldoffCount = 0;
+uint32 CritSectionCount = 0;
I wondered if these are used in PG_TRY-CATCH blocks in a way that would
still require volatile. I couldn't find any such usage by some quick
grepping, so I think we're good, but I thought I'd mention it.

Hmm. Still thinking about this.

+/*
+ * The set of "standard" interrupts that CHECK_FOR_INTERRUPTS() and
+ * ProcessInterrupts() handle.  These perform work that is safe to run whenever
+ * interrupts are not "held".  Other kinds of interrupts are only handled at
+ * more restricted times.
+ */
+#define INTERRUPT_STANDARD_MASK                                                         \

Some interrupts are missing from this mask:

- INTERRUPT_PARALLEL_APPLY_MESSAGE

Oops, that one ^ is a rebasing mistake. I wrote the ancestor of this
patch in 2021, and that new procsignal arrived in 2023, and I put the
code in to handle it, but I forgot to add it to the mask, which gives
me an idea (see below)...

- INTERRUPT_IDLE_STATS_UPDATE_TIMEOUT
- INTERRUPT_SINVAL_CATCHUP
- INTERRUPT_NOTIFY

Is that on purpose?

INTERRUPT_SINVAL_CATCHUP and INTERRUPT_NOTIFY are indeed handled
differently on purpose. In master, they don't set InterruptPending,
and they are not handled by regular CHECK_FOR_INTERRUPTS() sites, but
in the patch they still need a bit in pending_interrupts, and that is
what that mask hides from CHECK_FOR_INTERRUPTS(). They are checked
explicitly in ProcessClientReadInterrupt(). I think the idea is that
we can't handle sinval at random places because that might create
dangling pointers to cached objects where we don't expect them, and we
can't emit NOTIFY-related protocol messages at random times either.

There is something a little funky about _IDLE_STATS_UPDATE_TIMEOUT,
though. It has a different scheme for running only when idle, where
if it opts not to do anything, it doesn't consume the interrupt (a
later CFI() will, but the latch is not set so we might sleep). I was
confused by that. I think I have changed to be more faithful to
master's behaviour now.

Hmm, a better terminology for the interupts that CFI handles would be
s/standard/regular/, so I have changed that.

New idea: it would be less error-prone if we instead had a mask of
these special cases, of which there are now only two. Tried that way!

-/*
- * Because backends sitting idle will not be reading sinval events, we
- * need a way to give an idle backend a swift kick in the rear and make
- * it catch up before the sinval queue overflows and forces it to go
- * through a cache reset exercise. This is done by sending
- * PROCSIG_CATCHUP_INTERRUPT to any backend that gets too far behind.
- *
- * The signal handler will set an interrupt pending flag and will set the
- * processes latch. Whenever starting to read from the client, or when
- * interrupted while doing so, ProcessClientReadInterrupt() will call
- * ProcessCatchupEvent().
- */
-volatile sig_atomic_t catchupInterruptPending = false;

Would be nice to move that comment somewhere else rather than remove it
completely.

OK, I moved it to the top of ProcessCatchupInterrupt().

--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -444,6 +444,10 @@ InitProcess(void)
OwnLatch(&MyProc->procLatch);
SwitchToSharedLatch();

+     /*We're now ready to accept interrupts from other processes. */
+     pg_atomic_init_u32(&MyProc->pending_interrupts, 0);
+     SwitchToSharedInterrupts();
+
/* now that we have a proc, report wait events to shared memory */
pgstat_set_wait_event_storage(&MyProc->wait_event_info);

@@ -611,6 +615,9 @@ InitAuxiliaryProcess(void)
OwnLatch(&MyProc->procLatch);
SwitchToSharedLatch();

+     /* We're now ready to accept interrupts from other processes. */
+     SwitchToSharedInterrupts();
+
/* now that we have a proc, report wait events to shared memory */
pgstat_set_wait_event_storage(&MyProc->wait_event_info);

Is there a reason for the different initialization between a regular
backend and aux process?

No. But I thought about something else to fix here. Really we don't
want to switch to shared interrupts until we are ready for CFI() to do
stuff. I think that should probably be at the places where master
unblocks signals. Otherwise, for example, if someone sends you an
interrupt while you're starting up, something as innocent as
elog(DEBUG, ...), which reaches CFI(), might try to do things for
which the infrastructure is not yet fully set up, for example
INTERRUPT_BARRIER.

Not done yet, but wanted to share this new version.

+/*
+ * Switch to shared memory interrupts.  Other backends can send interrupts
+ * to this one if they know its ProcNumber.
+ */
+void
+SwitchToSharedInterrupts(void)
+{
+     pg_atomic_fetch_or_u32(&MyProc->pending_interrupts, pg_atomic_read_u32(MyPendingInterrupts));
+     MyPendingInterrupts = &MyProc->pending_interrupts;
+}
Hmm, I think there's a race condition here (and similarly in
SwitchToLocalInterrupts), if the signal handler runs between the
pg_atomic_fetch_or_u32, and changing MyPendingInterrupts. Maybe
something like this instead:

MyPendingInterrupts = &MyProc->pending_interrupts;
pg_memory_barrier();
pg_atomic_fetch_or_u32(&MyProc->pending_interrupts,
pg_atomic_read_u32(LocalPendingInterrupts));

Yeah, right, done.

And perhaps this should also clear LocalPendingInterrupts, just to be tidy.

I used atomic_exchange() to read and clear the bits in one go.

@@ -138,7 +139,8 @@
typedef struct ProcState
{
/* procPid is zero in an inactive ProcState array entry. */
-     pid_t           procPid;                /* PID of backend, for signaling */
+     pid_t           procPid;                /* pid of backend */
+     ProcNumber      pgprocno;               /* for sending interrupts */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
int                     nextMsgNum;             /* next message number to read */
bool            resetState;             /* backend needs to reset its state */

We can easily remove procPid altogether now that we have pgprocno here.

Since other things access those values, I propose to remove them in
separate patches.

Similarly with the pid/pgprocno in ReplicationSlot and WalSndState.

Same. Those pids show up in user interfaces, so I think they should
be handled in separate patches.

Note to self: I need to change some pgprocno to proc_number...

The next problems to remove are, I think, the various SIGUSR2, SIGINT,
SIGTERM signals sent by the postmaster. These should clearly become
SendInterrupt() or ProcSetLatch(). The problem here is that the
postmaster doesn't have the proc numbers yet. One idea is to teach
the postmaster to assign them! Not explored yet.

This version is passing on Windows. I'll create a CF entry. Still
work in progress!

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Thomas Munro (#11)

Re: Interrupts vs signals

On 10/07/2024 09:48, Thomas Munro wrote:

On Mon, Jul 8, 2024 at 9:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

The patch actually does both: it still does kill(SIGUSR1) and also sets
the latch.

Yeah, I had some ideas about supporting old extension code that really
wanted a SIGUSR1, but on reflection, the only reason anyone ever wants
that is so that sigusr1_handler can SetLatch(), which pairs with
WaitLatch() in WaitForBackgroundWorker*(). Let's go all the way and
assume that.

I think it would be nice if RegisterDynamicBackgroundWorker() had a
"bool notify_me" argument, instead of requiring the caller to set
"bgw_notify_pid = MyProcPid" before the call. That's a
backwards-compatibility break, but maybe we should bite the bullet and
do it. Or we could do this in RegisterDynamicBackgroundWorker():

if (worker->bgw_notify_pid == MyProcPid)
worker->bgw_notify_pgprocno = MyProcNumber;

I think we can forbid setting pgw_notify_pid to anything else than 0 or
MyProcPid.

Another idea: we could keep the bgw_notify_pid field around for a
while, documented as unused and due to be removed in future. We could
automatically capture the caller's proc number. So you get latch
wakeups by default, which I expect many people want, and most people
could cope with even if they don't want them. If you really really
don't want them, you could set a new flag BGW_NO_NOTIFY.

Ok. I was going to say that it feels excessive to change the default
like that. However, searching for RegisterDynamicBackgroundWorker() in
github, I can't actually find any callers that don't set pg_notify_pid.
So yeah, make sense.

I have now done this part of the change in a new first patch. This
particular use of kill(SIGUSR1) is separate from the ProcSignal
removal, it's just that it relies on ProcSignal's handler's default
action of calling SetLatch(). It's needed so the ProcSignal-ectomy
can fully delete sigusr1_handler(), but it's not directly the same
thing, so it seemed good to split the patch.

PostmasterMarkPIDForWorkerNotify() is now unused. Which means that
bgworker_notify is never set, and BackgroundWorkerStopNotifications() is
never called either.

A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it
can do any damage if you called it on a pointer to garbage, except if
the pointer itself is bogus, then just dereferencing it an cause a
segfault. So it would be nice to have a version specifically designed
with that in mind. For example, it could assume that the Latch's pid is
never legally equal to MyProcPid, because postmaster cannot own any latches.

Yeah I'm starting to think that all we need to do here is range-check
the proc number. Here's a version that adds:
ProcSetLatch(proc_number). Another idea would be for SetLatch(latch)
to sanitise the address of a latch, ie its offset and range.

Hmm, I don't think postmaster should trust ProcGlobal->allProcCount either.

The next problems to remove are, I think, the various SIGUSR2, SIGINT,
SIGTERM signals sent by the postmaster. These should clearly become
SendInterrupt() or ProcSetLatch().

The problem here is that the
postmaster doesn't have the proc numbers yet. One idea is to teach
the postmaster to assign them! Not explored yet.

I've been thinking that we should:

a) assign every child process a PGPROC entry, and make postmaster
responsible for assigning them like you suggest. We'll need more PGPROC
entries, because currently a process doesn't reserve one until
authentication has happened. Or we change that behavior.

b) Use the pmsignal.c slot numbers for this, instead of ProcNumbers.
Postmaster already assigns those.

I'm kind of leaning towards b) for now, because it feels like a much
smaller patch. In the long run, it would be nice if every child process
had a ProcNumber, though. It was a nice simplification in v17 that we
don't have separate BackendId and ProcNumber anymore; similarly it would
be nice to not have separate PMChildSlot and ProcNumber concepts.

--
Heikki Linnakangas
Neon (https://neon.tech)

#13

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#12)

Re: Interrupts vs signals

On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

a) assign every child process a PGPROC entry, and make postmaster
responsible for assigning them like you suggest. We'll need more PGPROC
entries, because currently a process doesn't reserve one until
authentication has happened. Or we change that behavior.

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Robert Haas (#13)

Re: Interrupts vs signals

On 29/07/2024 19:56, Robert Haas wrote:

On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

a) assign every child process a PGPROC entry, and make postmaster
responsible for assigning them like you suggest. We'll need more PGPROC
entries, because currently a process doesn't reserve one until
authentication has happened. Or we change that behavior.

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.

Yes, there's a limit, roughly 2x max_connections. see
MaxLivePostmasterChildren().

There's another issue with that that I was about to post in another
thread, but here goes: the MaxLivePostmasterChildren() limit is shared
by all regular backends, bgworkers and autovacuum workers. If you open a
lot of TCP connections to postmaster and don't send anything to the
server, you exhaust those slots, and the server won't be able to start
any autovacuum workers or background workers either. That's not great. I
started to work on approach b), with separate pools of slots for
different kinds of child processes, which fixes that. Stay tuned for a
patch.

In addition to that, you can have an unlimited number of "dead-end"
backends, which are doomed to just respond with "sorry, too many
clients" error. The only limit on those is the amount of resources
needed for all the processes and a little memory to track them.

--
Heikki Linnakangas
Neon (https://neon.tech)

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Robert Haas (#13)

Re: Interrupts vs signals

Robert Haas <robertmhaas@gmail.com> writes:

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

The former. IIRC, the postmaster won't spawn more than 2X max_connections
subprocesses (don't recall the exact limit, but it's around there).

regards, tom lane

#16

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Tom Lane (#15)

Re: Interrupts vs signals

On Mon, Jul 29, 2024 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

The former. IIRC, the postmaster won't spawn more than 2X max_connections
subprocesses (don't recall the exact limit, but it's around there).

Hmm. Not to sidetrack this thread too much, but multiplying by two
doesn't really sound like the right idea to me. The basic idea
articulated in the comment for canAcceptConnections() makes sense:
some backends might fail authentication, or might be about to exit, so
it makes sense to allow for some slop. But 2X is a lot of slop even on
a machine with the default max_connections=100, and people with
connection management problems are likely to be running with
max_connections=500 or max_connections=900 or even (insanely)
max_connections=2000. Continuing with a connection attempt because we
think that hundreds or thousands of connections that are ahead of us
in the queue might clear out of the way before we need a PGPROC is not
a good bet.

I wonder if we ought to restrict this to a small, flat value, like say
50, or by a new GUC that defaults to such a value if a constant seems
problematic. Maybe it doesn't really matter. I'm not sure how much
work we'd save by booting out the doomed connection attempt earlier.

The unlimited number of dead-end backends doesn't sound too great
either. I don't have another idea, but presumably resisting DDOS
attacks and/or preserving resources for things that still have a
chance of working ought to take priority over printing a nicer error
message from a connection that's doomed to fail anyway.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Thomas Munro (#11)

Re: Interrupts vs signals

On 10/07/2024 09:48, Thomas Munro wrote:

The next problems to remove are, I think, the various SIGUSR2, SIGINT,
SIGTERM signals sent by the postmaster. These should clearly become
SendInterrupt() or ProcSetLatch(). The problem here is that the
postmaster doesn't have the proc numbers yet. One idea is to teach
the postmaster to assign them! Not explored yet.

With my latest patches on the "Refactoring postmaster's code to cleanup
after child exit" thread [1]/messages/by-id/8f2118b9-79e3-4af7-b2c9-bd5818193ca4@iki.fi, every postmaster child process is assigned
a slot in the pmsignal.c array, including all the aux processes. If we
moved 'pending_interrupts' and the process Latch to the pmsignal.c
array, then you could send an interrupt also to a process that doesn't
have a PGPROC entry. That includes dead-end backends, backends that are
still in authentication, and the syslogger.

That would also make it so that the postmaster would never need to poke
into the procarray. pmsignal.c is already designated as the limited
piece of shared memory that is accessed by the postmaster
(BackgroundWorkerSlots is the other exception), so it would be kind of
nice if all the information that the postmaster needs to send an
interrupt was there. That would mean that where you currently use a
ProcNumber to identify a process, you'd use an index into the
PMSignalState array instead.

I don't insist on changing that right now, I think this patch is OK as
it is, but that might be a good next step later.

[1]: /messages/by-id/8f2118b9-79e3-4af7-b2c9-bd5818193ca4@iki.fi
/messages/by-id/8f2118b9-79e3-4af7-b2c9-bd5818193ca4@iki.fi

I'm also wondering about the relationship between interrupts and
latches. Currently, SendInterrupt sets a latch to wake up the target
process. I wonder if it should be the other way 'round? Move all the
wakeup code, with the signalfd, the self-pipe etc to interrupt.c, and in
SetLatch, call SendInterrupt to wake up the target process? Somehow that
feels more natural to me, I think.

This version is passing on Windows. I'll create a CF entry. Still
work in progress!

Some comments on the patch details:

ereport(WARNING,
(errmsg("NOTIFY queue is %.0f%% full", fillDegree * 100),
- (minPid != InvalidPid ?
- errdetail("The server process with PID %d is among those with

the oldest transactions.", minPid)

+				 (minPgprocno != INVALID_PROC_NUMBER ?
+				  errdetail("The server process with pgprocno %d is among those

with the oldest transactions.", minPgprocno)

: 0),
-				 (minPid != InvalidPid ?
+				 (minPgprocno != INVALID_PROC_NUMBER ?
errhint("The NOTIFY queue cannot be emptied until that process

ends its current transaction.")

: 0)));

This makes the message less useful to the user, as the ProcNumber isn't
exposed to users. With the PID, you can do "pg_terminate_backend(pid)"

diff --git a/src/backend/optimizer/util/pathnode.c

b/src/backend/optimizer/util/pathnode.c

index c42742d2c7b..bfb89049020 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -18,6 +18,7 @@
#include "foreign/fdwapi.h"
#include "miscadmin.h"
+#include "postmaster/interrupt.h"
#include "nodes/extensible.h"
#include "optimizer/appendinfo.h"
#include "optimizer/clauses.h"

misordered

+ * duplicated interrupts later if we switch backx.

typo: backx -> back

-	if (IdleInTransactionSessionTimeoutPending)
+	if (ConsumeInterrupt(INTERRUPT_IDLE_TRANSACTION_TIMEOUT))
{
/*
* If the GUC has been reset to zero, ignore the signal.  This is
@@ -3412,7 +3361,6 @@ ProcessInterrupts(void)
* interrupt.  We need to unset the flag before the injection point,
* otherwise we could loop in interrupts checking.
*/
-		IdleInTransactionSessionTimeoutPending = false;
if (IdleInTransactionSessionTimeout > 0)
{
INJECTION_POINT("idle-in-transaction-session-timeout");

The "We need to unset the flag.." comment is a bit out of place now,
since the flag was already cleared by ConsumeInterrupt(). Same in the
INTERRUPT_TRANSACTION_TIMEOUT and INTERRUPT_IDLE_SESSION_TIMEOUT
handling after this.

--
Heikki Linnakangas
Neon (https://neon.tech)

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Heikki Linnakangas (#17)

Re: Interrupts vs signals

On 07/08/2024 17:59, Heikki Linnakangas wrote:

I'm also wondering about the relationship between interrupts and
latches. Currently, SendInterrupt sets a latch to wake up the target
process. I wonder if it should be the other way 'round? Move all the
wakeup code, with the signalfd, the self-pipe etc to interrupt.c, and in
SetLatch, call SendInterrupt to wake up the target process? Somehow that
feels more natural to me, I think.

I explored that a little, see attached patch set. It's going towards the
same end state as your patches, I think, but it starts from different
angle. In a nutshell:

Remove Latch as an abstraction, and replace all use of Latches with
Interrupts. When I originally created the Latch abstraction, I imagined
that we would have different latches for different purposes, but in
reality, almost all code just used the general-purpose "process latch".
this patch accepts that reality and replaces the Latch struct directly
with the interrupt mask in PGPROC.

This initially defines only two interrupts. INTERRUPT_GENERAL_WAKEUP is
the main one, sending that interrupt to a process replaces setting the
process's generic process latch in PGPROC:

* SetLatch(MyLatch) -> RaiseInterrupt(INTERRUPT_GENERAL_WAKEUP)

* SetLatch(&ProcGlobal->allProcs[procno].procLatch) ->
SendInterrupt(procno, INTERRUPT_GENERAL_WAKEUP

* ResetLatch(MyLatch) -> ClearInterrupt(INTERRUPT_GENERAL_WAKEUP)

* WaitLatch(MyLatch) -> WaitInterrupt(1 << INTERRUPT_GENERAL_WAKEUP)

There was only one extra Latch in addition the process's generic
procLatch, the recoveryWakeupLatch. It's replaced by the second
interrupt bit, INTERRUPT_RECOVERY_WAKEUP.

This is complementary or preliminary work to your patch set. All the
changes to replace ProcSignals with different interrupt bits could go on
top of this.

This patch set is work in progress, I'd love to hear your thoughts on
this before I spent more time on this. (Haven't tested on Windows for
example).

Patches 0001 - 0006 are just little cleanups and minor refactorings that
I think make sense even without the rest of the work, though.

0007 is the main patch.

Patch 0010 addresses the problem discussed at
/messages/by-id/CALDaNm01_KEgHM1tKtgXkCGLJ5209SMSmGw3UmhZbOz365_=eA@mail.gmail.com.
Other solutions are discussed on that thread, but while working on
these, I realized that with these new interrupts, it's pretty
straightforward to fix by introducing one more interrupt reason.

--
Heikki Linnakangas
Neon (https://neon.tech)

#19

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#18)

Re: Interrupts vs signals

On Sun, Aug 25, 2024 at 5:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/08/2024 17:59, Heikki Linnakangas wrote:

I'm also wondering about the relationship between interrupts and
latches. Currently, SendInterrupt sets a latch to wake up the target
process. I wonder if it should be the other way 'round? Move all the
wakeup code, with the signalfd, the self-pipe etc to interrupt.c, and in
SetLatch, call SendInterrupt to wake up the target process? Somehow that
feels more natural to me, I think.

I explored that a little, see attached patch set. It's going towards the
same end state as your patches, I think, but it starts from different
angle. In a nutshell:

Remove Latch as an abstraction, and replace all use of Latches with
Interrupts. When I originally created the Latch abstraction, I imagined
that we would have different latches for different purposes, but in
reality, almost all code just used the general-purpose "process latch".
this patch accepts that reality and replaces the Latch struct directly
with the interrupt mask in PGPROC.

Some very initial reactions:

* I like it!

* This direction seems to fit quite nicely with future ideas about
asynchronous network I/O. That may sound unrelated, but imagine that
a future version of WaitEventSet is built on Linux io_uring (or
Windows iorings, or Windows IOCP, or kqueue), and waits for the kernel
to tell you that network data has been transferred directly into a
user space buffer. You could wait for the interrupt word to change at
the same time by treating it as a futex[1]https://man7.org/linux/man-pages/man3/io_uring_prep_futex_wait.3.html. Then all that other stuff
-- signalfd, is_set, maybe_sleeping -- just goes away, and all we have
left is one single word in memory. (That it is possible to do that is
not really a coincidence, as our own Mr Freund asked Mr Axboe to add
it[2]https://lore.kernel.org/lkml/20230720221858.135240-1-axboe@kernel.dk/. The existing latch implementation techniques could be used as
fallbacks, but when looked at from the right angle, once you squish
all the wakeup reasons into a single word, it's all just an
implementation of a multiplexable futex with extra steps.)

* Speaking of other problems in other threads that might be solved by
this redesign, I think I see the outline of some solutions to the
problem of different classes of wakeup which you can handle at
different times, using masks. There is a tension in a few places
where we want to handle some kind of interrupts but not others in
localised wait points, which we sort of try to address by holding
interrupts or holding cancel interrupts, but it's not satisfying and
there are some places where it doesn't work well. Needs a lot more
thought, but a basic step would be: after old_interrupt_vector =
pg_atomic_fetch_or_u32(interrupt_vector, new_bits), if
(old_interrupt_vector & new_bits) == new_bits, then you didn't
actually change any bits, so you probably don't really need to wake
the other backend. If someone is currently unable to handle that type
of interrupt (has ignored, ie not cleared, those bits) or is already
in the process of handling it (is currently being rescheduled but
hasn't cleared those bits yet), then you don't bother to wake it up.
Concretely, it could mean that we avoid some of the useless wakeup
storm problems we see in vacuum delays or while executing a query and
not in a good place to handle sinval wakeups, etc. These are just
some raw thoughts, I am not sure about the bigger picture of that
topic yet.

* Archeological note on terminology: the reason almost every relation
database and all the literature uses the term "latch" for something
like our LWLocks seems to be that latches were/are one of the kinds of
system-provided mutex on IBM System/370 and modern descendents ie
z/OS. Oracle and other systems that started as knock-offs of the IBM
System R (the original SQL system, of which DB2 is the modern heir)
continued that terminology, even though they ran on VMS or Unix or
whatever. I would not be sad if we removed our unusual use of the
term latch.

[1]: https://man7.org/linux/man-pages/man3/io_uring_prep_futex_wait.3.html
[2]: https://lore.kernel.org/lkml/20230720221858.135240-1-axboe@kernel.dk/

#20

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Thomas Munro (#19)

Re: Interrupts vs signals

On 26/08/2024 02:05, Thomas Munro wrote:

On Sun, Aug 25, 2024 at 5:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/08/2024 17:59, Heikki Linnakangas wrote:

I'm also wondering about the relationship between interrupts and
latches. Currently, SendInterrupt sets a latch to wake up the target
process. I wonder if it should be the other way 'round? Move all the
wakeup code, with the signalfd, the self-pipe etc to interrupt.c, and in
SetLatch, call SendInterrupt to wake up the target process? Somehow that
feels more natural to me, I think.

I explored that a little, see attached patch set. It's going towards the
same end state as your patches, I think, but it starts from different
angle. In a nutshell:

Remove Latch as an abstraction, and replace all use of Latches with
Interrupts. When I originally created the Latch abstraction, I imagined
that we would have different latches for different purposes, but in
reality, almost all code just used the general-purpose "process latch".
this patch accepts that reality and replaces the Latch struct directly
with the interrupt mask in PGPROC.

Some very initial reactions:

* I like it!

Ok, here's a new version, with a bunch of bugs and FIXMEs fixed,
comments and other polish. A few things remain that I'm not sure about:

- Terminology. We now have storage/interrupt.h and
postmaster/interrupt.h. They're not the same thing, but there's also
some overlap, especially after your patch set is applied.

Aside from filenames, "interrupt" now means at least two things: a
wakeup that you can send to a backend with SendInterrupt, and the thing
that you check with CHECK_FOR_INTERRUPTS(). They're related, but
different. To show case that confusion, the patch now contains this gem,
which was a result of mechanically replacing "latch" with "interrupt":

* We process interrupts whenever the interrupt has been set, so
* cancel/die interrupts are processed quickly.

- Fujii: this replaces the "recoveryWakeupLatch" with a separate
interrupt type. There was an earlier attempt at replacing
recoveryWakeupLatch with the process's regular latch which was reverted,
commits ac22929a26, and 00f690a239. I think this solution doesn't suffer
from the same problems as that earlier attempt, but if you have a chance
to review this, I would appreciate that. In a nutshell,
INTERRUPT_RECOVERY_CONTINUE interrupt is now used instead of
recoveryWakeupLatch, but the places that previously waited just on
recoveryWakeupLatch, now wait on both INTERRUPT_GENERAL_WAKEUP, which is
equivalent to waiting on the process latch, and
INTERRUPT_RECOVERY_CONTINUE. So those loops should now react more
quickly to signals like SIGHUP.

- Backwards-compatibility and extensions. This will break any extensions
that use Latches. Extensions that just used WaitLatch(MyLatch) or
similar are easy to convert to use latches instead. Or we could keep
around some backwards-compatibility macros like 0008 here does, to avoid
the code churn. However, if an extension is creating its own latches or
doing more complicated stuff with them, it gets harder to maintain
source-code compatibility for them.

Aside from the backwards-compatibility aspect, should we reserve a few
INTERRUPT_* values for extensions?

* This direction seems to fit quite nicely with future ideas about
asynchronous network I/O. That may sound unrelated, but imagine that
a future version of WaitEventSet is built on Linux io_uring (or
Windows iorings, or Windows IOCP, or kqueue), and waits for the kernel
to tell you that network data has been transferred directly into a
user space buffer. You could wait for the interrupt word to change at
the same time by treating it as a futex[1]. Then all that other stuff
-- signalfd, is_set, maybe_sleeping -- just goes away, and all we have
left is one single word in memory. (That it is possible to do that is
not really a coincidence, as our own Mr Freund asked Mr Axboe to add
it[2]. The existing latch implementation techniques could be used as
fallbacks, but when looked at from the right angle, once you squish
all the wakeup reasons into a single word, it's all just an
implementation of a multiplexable futex with extra steps.)

Cool

* Speaking of other problems in other threads that might be solved by
this redesign, I think I see the outline of some solutions to the
problem of different classes of wakeup which you can handle at
different times, using masks. There is a tension in a few places
where we want to handle some kind of interrupts but not others in
localised wait points, which we sort of try to address by holding
interrupts or holding cancel interrupts, but it's not satisfying and
there are some places where it doesn't work well. Needs a lot more
thought, but a basic step would be: after old_interrupt_vector =
pg_atomic_fetch_or_u32(interrupt_vector, new_bits), if
(old_interrupt_vector & new_bits) == new_bits, then you didn't
actually change any bits, so you probably don't really need to wake
the other backend. If someone is currently unable to handle that type
of interrupt (has ignored, ie not cleared, those bits) or is already
in the process of handling it (is currently being rescheduled but
hasn't cleared those bits yet), then you don't bother to wake it up.
Concretely, it could mean that we avoid some of the useless wakeup
storm problems we see in vacuum delays or while executing a query and
not in a good place to handle sinval wakeups, etc. These are just
some raw thoughts, I am not sure about the bigger picture of that
topic yet.

Yeah, I expect this work to help with those issues, but also not sure of
the details yet.

--
Heikki Linnakangas
Neon (https://neon.tech)

#21

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#20)

#22

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Heikki Linnakangas (#20)

#23

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#22)

#24

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Robert Haas (#23)

#25

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Michael Paquier (#24)

#26

Jelte Fennema-Nio

postgres@jeltef.nl

over 1 year ago

In reply to: Heikki Linnakangas (#25)

#27

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Jelte Fennema-Nio (#26)

#28

Jelte Fennema-Nio

postgres@jeltef.nl

over 1 year ago

In reply to: Heikki Linnakangas (#27)

#29

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Heikki Linnakangas (#25)

#30

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Heikki Linnakangas (#29)

#31

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#29)

#32

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Thomas Munro (#31)

#33

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Thomas Munro (#31)

#34

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Robert Haas (#32)

#35

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#33)

#36

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Thomas Munro (#35)

#37

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 1 year ago

In reply to: Heikki Linnakangas (#36)

#38

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Heikki Linnakangas (#37)

#39

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Andres Freund (#38)

#40

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Andres Freund (#39)

#41

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#38)

#42

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#41)

#43

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#42)

#44

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#38)

#45

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 1 year ago

In reply to: Heikki Linnakangas (#44)

#46

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Heikki Linnakangas (#45)

#47

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Heikki Linnakangas (#44)

#48

Joel Jacobson

joel@compiler.org

10 months ago

In reply to: Heikki Linnakangas (#45)

#49

Joel Jacobson

joel@compiler.org

10 months ago

In reply to: Joel Jacobson (#48)

#50

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

4 months ago

In reply to: Andres Freund (#47)

#51

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

4 months ago

In reply to: Heikki Linnakangas (#50)

#52

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Heikki Linnakangas (#51)

#53

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Andres Freund (#52)

#54

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#53)

#55

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#54)

#56

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Heikki Linnakangas (#54)

Interrupts vs signals

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: