injection_points: Switch wait/wakeup to use atomics rather than latches
Hi all,
(Adding Andrey in CC, as I'm sure he is interested in that.)
While looking at the test proposed on the thread about the ProcKill(),
I have been reminded about the fact that relying on latches and a
condition variable for the wait and the wakeups has its limits:
/messages/by-id/aheVjCHmcbXBtiy0@paquier.xyz
In this case, we are trying to synchronize backends once they don't
have latch assigned anymore, which defeats the purpose of wait/wakeup
because the condition variable used in injection_points while waiting
expects a Latch to be set for the processes we are waiting on.
Folks have complained about this limitation a couple of times in the
past, and I never got around to do something about it. While looking
at that I have finished with the patch attached, which was
surprisingly simpler than what I thought was needed. This replaces
the condition variable with a set of atomic counters. The counters
are incremented at wakeup, and the wait checks them on a periodic
basis. The wait loop uses a delay that increases over time, maxed at
100ms so as we can get a good responsiveness on fast machines, without
burning CPU for nothing in tests that require more wait time due to a
tight loop with the counter checks.
One thing worth noticing is the CHECK_FOR_INTERRUPTS() in the wait
loop, which is something we need for the autovacuum test in test_misc
that requires some signaling and interrupt processing.
It may make sense to be conservative and limit ourselves to do this
change on HEAD, but I'd like to suggest a backpatch down to v17 so as
future tests that rely on a such change can be backpatched. I would
need this change for the other test, still consistency in the facility
primes for me here.
Note: The CI seems happy with the patch.
Thoughts or comments?
--
Michael
Attachments:
0001-injection_points-Switch-wait-wakeup-to-rely-on-atomi.patchtext/plain; charset=us-asciiDownload+27-27
On Wed, May 27, 2026 at 10:43 PM Michael Paquier <michael@paquier.xyz> wrote:
While looking at the test proposed on the thread about the ProcKill(),
I have been reminded about the fact that relying on latches and a
condition variable for the wait and the wakeups has its limits:
/messages/by-id/aheVjCHmcbXBtiy0@paquier.xyz
After reading this email, the linked-to email, and the commit message
for the patch, I still don't have a clear understanding of what this
is intended to fix. It seems like it's going to make the
responsiveness worse. In general, we want to replace escalating wait
loops with things that wake up instantly at the right time, and this
is going in the opposite direction.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, May 28, 2026 at 08:40:39AM -0400, Robert Haas wrote:
After reading this email, the linked-to email, and the commit message
for the patch, I still don't have a clear understanding of what this
is intended to fix. It seems like it's going to make the
responsiveness worse. In general, we want to replace escalating wait
loops with things that wake up instantly at the right time, and this
is going in the opposite direction.
This is an exchange between responsiveness of the system and
flexibility. I have had two complaints in the past about the fact
that the waits and wakeups were not doable due to the fact that we
rely on condition variables and latches:
- Postmaster context (lack of dsm access as one). Heikki has
mentioned that to me once as annoying when hacking on tests there at
protocol level, at least.
- Second case as shown on the previous thread, which was a tricky
scenario involving the termination of backends.
One limitation is also related to wait event visibility, which may not
be visible in pg_stat_activity. We could simply add a LOG entry in
injection_wait() once the old count is read, and rely on a server log
lookup in the TAP tests where we cannot use pg_stat_activity.
Compared to redesigning all the facilities that injection_points
relies on, this patch was striking me as having a good balance in
terms of responsiveness (min 10us, max 100ms) vs portability. The
minimum threshold does not really matter much in terms of runtime on
fast machines.
Does this explanation make sense?
--
Michael
On Thu, May 28, 2026 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, May 28, 2026 at 08:40:39AM -0400, Robert Haas wrote:
After reading this email, the linked-to email, and the commit message
for the patch, I still don't have a clear understanding of what this
is intended to fix. It seems like it's going to make the
responsiveness worse. In general, we want to replace escalating wait
loops with things that wake up instantly at the right time, and this
is going in the opposite direction.This is an exchange between responsiveness of the system and
flexibility. I have had two complaints in the past about the fact
that the waits and wakeups were not doable due to the fact that we
rely on condition variables and latches:
I'm still struggling to understand. Condition variables and latches
are both designed to allow for nice waits and wakeups.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 29/05/2026 15:48, Robert Haas wrote:
On Thu, May 28, 2026 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, May 28, 2026 at 08:40:39AM -0400, Robert Haas wrote:
After reading this email, the linked-to email, and the commit message
for the patch, I still don't have a clear understanding of what this
is intended to fix. It seems like it's going to make the
responsiveness worse. In general, we want to replace escalating wait
loops with things that wake up instantly at the right time, and this
is going in the opposite direction.This is an exchange between responsiveness of the system and
flexibility. I have had two complaints in the past about the fact
that the waits and wakeups were not doable due to the fact that we
rely on condition variables and latches:I'm still struggling to understand. Condition variables and latches
are both designed to allow for nice waits and wakeups.
They only work after you have a PGPROC slot. If you want to inject code
to authentication, or into postmaster, you cannot use them.
- Heikki
On Fri, May 29, 2026 at 9:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
They only work after you have a PGPROC slot. If you want to inject code
to authentication, or into postmaster, you cannot use them.
OK, got it now.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, May 29, 2026 at 12:00:46PM -0400, Robert Haas wrote:
On Fri, May 29, 2026 at 9:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
They only work after you have a PGPROC slot. If you want to inject code
to authentication, or into postmaster, you cannot use them.OK, got it now.
It seems like Heikki's comment was better worded than mine.
Also mentioned upthread, but the lack of PGPROC also means a lack of
monitoring as wait events cannot be tracked. Currently, we rely on
that in the TAP tests. For cases where the procs are not available, I
don't have a better idea than generating a LOG entry after the wait
counts have been generated (with a PID) and couple that with a poll of
the server logs to let a script understand that a process is in
waiting mode.
As a whole, I don't think that we should try to be fancy with the
implementation, which is why I have used primitives that should work
in any context, and I'm not convinced that this is worth its own
facility if that just means more responsiveness (in most cases the
wait should not take more than a couple ms to notice a wakeup). I'm
open to more fancy ideas, of course.
--
Michael
On 28 May 2026, at 07:43, Michael Paquier <michael@paquier.xyz> wrote:
Andrey in CC, as I'm sure he is interested in that.
Thanks! That's exactly what I need for my tests.
On 29 May 2026, at 18:31, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I'm still struggling to understand. Condition variables and latches
are both designed to allow for nice waits and wakeups.They only work after you have a PGPROC slot. If you want to inject code to authentication, or into postmaster, you cannot use them.
I have another reason: postmaster death behavior. When we wait on
ConVar and postmaster is kill-9-ed, we release all LWLocks. Which causes
corruption [0]/messages/by-id/B3C69B86-7F82-4111-B97F-0005497BB745@yandex-team.ru, because checkpointer can flush something that's not in WAL.
So I'm trying to build corruption-seeking tests using tool that can induce corruption
in tests.
About the patch:
- inj_state->wait_counts[index]++;
SpinLockRelease(&inj_state->lock);
- /* And broadcast the change to the waiters */
- ConditionVariableBroadcast(&inj_state->wait_point);
+ pg_atomic_fetch_add_u32(&inj_state->wait_counts[index], 1);
Can we move pg_atomic_fetch_add_u32() back under the lock?
We determine slot index under lock, then wakeup slot outside the lock.
In a correctly written test meaning this is not a problem.
However, technically, identity of a slot can change between releasing the lock
and incrementing wait_counts[index].
I'll do another pass tomorrow, maybe something else will catch my eye.
Best regards, Andrey Borodin.
[0]: /messages/by-id/B3C69B86-7F82-4111-B97F-0005497BB745@yandex-team.ru
On 30 May 2026, at 13:05, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
I'll do another pass tomorrow, maybe something else will catch my eye.
I've tried the patch on my old corruption experiments. And it works for me.
I had to switch killing to something like
foreach my $i (1 .. 100)
{
my @alive = grep { kill 0, $_ } @cluster_pids;
last unless @alive;
kill 'KILL', @alive;
usleep(100_000);
}
The shared memory segment must be released before we attempt recovery.
But that's exactly what I wanted anyway. Thank you!
Best regards, Andrey Borodin.
On Mon, Jun 01, 2026 at 04:25:40PM +0500, Andrey Borodin wrote:
The shared memory segment must be released before we attempt recovery.
But that's exactly what I wanted anyway. Thank you!
How do you guarantee that a wait position is reached in the case where
you don't have access to wait events to make sure that the wait point
is reached?
--
Michael
On 2 Jun 2026, at 09:15, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jun 01, 2026 at 04:25:40PM +0500, Andrey Borodin wrote:
The shared memory segment must be released before we attempt recovery.
But that's exactly what I wanted anyway. Thank you!How do you guarantee that a wait position is reached in the case where
you don't have access to wait events to make sure that the wait point
is reached?
In a research test? sleep(1)
Best regards, Andrey Borodin.
On Tue, Jun 02, 2026 at 10:13:01AM +0500, Andrey Borodin wrote:
In a research test? sleep(1)
A hardcoded sleep can work on a fast machine but it makes the test
slow. A sleep is not a reliable technique if running the tests on a
slow machine, as an expected wait point may not have been reached. We
have both very slow and very fast animals in the buildfarm.
Rewording my question a bit: did you consider some options regarding
what an equivalent of a wait event lookup should look like when we
don't have a PGPROC? My idea of printing a LOG and do a server log
lookup would work, just asking if others have better ideas than the
only one I got.
--
Michael
On 2 Jun 2026, at 10:27, Michael Paquier <michael@paquier.xyz> wrote:
Rewording my question a bit: did you consider some options regarding
what an equivalent of a wait event lookup should look like when we
don't have a PGPROC? My idea of printing a LOG and do a server log
lookup would work, just asking if others have better ideas than the
only one I got.
For tests without PGPROC we can mmap() inj_state to a fixed file in
PGDATA/injection_points.shm. TAP can poll name[] to detect that a wait
point was reached and bump wait_counts[] to wake.
Best regards, Andrey Borodin.
On Tue, Jun 02, 2026 at 11:46:52PM +0500, Andrey Borodin wrote:
For tests without PGPROC we can mmap() inj_state to a fixed file in
PGDATA/injection_points.shm. TAP can poll name[] to detect that a wait
point was reached and bump wait_counts[] to wake.
That's a direction. Only mmap() would not be sufficient, as WIN32 has
its own non-POSIX idea on the matter with CreateFileMapping() &
friends. I am wondering if we should think harder about an interface
that could make such things easier for extensions. Or we could have a
portable layer added to injection_points, as well..
--
Michael
Hi Michael,
On 3 Jun 2026, at 03:23, Michael Paquier <michael@paquier.xyz> wrote:
That's a direction. Only mmap() would not be sufficient, as WIN32 has
its own non-POSIX idea on the matter with CreateFileMapping() &
friends.
Right, mmap() alone is not enough. I hacked up a prototype (on top of
your atomics commit) that maps the state portably on both sides: POSIX
mmap() in the backend and a file-backed CreateFileMapping() on WIN32,
and the same in a small standalone client. Perl cannot mmap()
portably, so the mapping is done by a tiny C helper that TAP drives,
rather than from the test script itself.
I am wondering if we should think harder about an interface that could
make such things easier for extensions. Or we could have a portable
layer added to injection_points, as well..
While prototyping I hit a constraint that I think answers this. To arm
a point without SQL it is not enough to reach this module's wait state:
you have to reach the registry of active points, i.e.
ActiveInjectionPoints in injection_point.c. INJECTION_POINT() consults
that array, and the module only supplies the callback once the core has
found the name. So the part that has to be reachable from outside lives
in the core, not in the module.
That splits the problem into two layers rather cleanly:
- core: the active-points array is backed by a file
(injection_points.shm in the data directory). This is the generic
piece - any extension could arm or inspect points out of band, and
it is what makes attach-without-SQL possible. The lock-free
generation protocol that reads the array is unchanged, so external
readers can rely on it too.
- module: injection_points keeps its own file
(injection_points_wait.shm) for the wait/wakeup coordination of the
"wait" action. That part is test-specific and stays out of the
core.
A standalone client (injection_points_state) maps both files the way
the backend does and can attach/detach a point and detect+release a
waiter, with no backend connection and no PGPROC. A TAP test runs the
whole flow without SQL for arming or waking: arm a wait point with the
client, trigger it from a background session, let the client poll the
mapped file until the point is reached (so no sleep, and it behaves the
same on slow and fast animals), disarm it, then wake. SQL is used only
to trigger the point and to observe injection_points_list().
This is a rough prototype to make the discussion concrete, not a
proposal of the final shape. To name few open points:
1. Where the portable mapping helper should live.
1. Keep the create-or-attach-file helper (POSIX + WIN32) local to
injection_point.c, as in the prototype. Simplest, no new public
surface.
2. Factor a small portable "named file-backed shared region" that any
extension could use - an "interface for extensions". More
general, but a larger commitment and more to review.
I have a slight preference for 1 now, growing into 2 only once a
second user actually appears.
2. One file or two. The prototype keeps the core registry and the
module's wait coordination in two files. Folding a wait counter into
the core InjectionPointEntry would make it a single file, but it
pushes test-only wait semantics into the core struct, which I would
rather avoid. Slight preference for two files.
3. The client writes the registry locklessly (no InjectionPointLock), so
it assumes nothing else attaches/detaches the same array
concurrently. That holds for arming points out of band around a
controlled session, but it is not a general concurrent-safe writer.
I think that is acceptable for tests, but I am not sure.
4. The path defaults to injection_points.shm under the data directory,
with an env override (PG_INJECTION_POINTS_FILE) so a point could be
armed before the data dir exists or in single player mode. I have
not added a test for that, so probably it does not work. Yet.
5. The external mirror reads and writes the pg_atomic_uint{32,64} fields
as plain integers. That only holds where those atomics are a bare
value; on a platform without native 64-bit atomics pg_atomic_uint64
falls back to a spinlock-protected struct with a different layout, so
the byte mirror would not match (and the backend's width assertion
would fail to build there in the first place). Even where they are
native, the 64-bit field forces 8-byte alignment that the mirror must
replicate - I got bitten by exactly that on the 32-bit build, where
the entries array otherwise starts 4 bytes too early. So a robust
portable layer, if we want one, is not as trivial as it first looked
to me.
6. Whether the external interface should do more than "wait". The
file-backed registry already lets an outside process attach a point
to any callback (error, notice, ...); the only thing we can observe
back without SQL today is that a wait was reached. A natural
generalization is a per-point hit counter that any process bumps when
it runs the point, readable from outside. That would let a test
assert "initdb really went through this path", or "the postmaster
reached this point before it failed at startup" - cases where there
is no backend to query. I have not built it, and it leans the same
way as 2. (it would put an observation counter into the core entry),
but it seems like the obvious next use-case if we go this route.
WDYT? I am not at all sure the file-backed registry is the right
long-term shape; if you prefer the elog-and-grep route I am happy to drop
this, it was mostly a way to see how invasive the alternative really is.
Thank you!
Best regards, Andrey Borodin.
P.S. Fun observation - it speeds up tests. I swapped the "point reached"
detection in test_misc/010_index_concurrently_upsert.pl (41 cases, each polling
pg_stat_activity every 100ms via a fresh psql) for a 10ms poll of the mapped
file through the client. Median test wall time dropped from 2.2s to 1.6s
(~27%) on my MacBook Air M5.