`pg_ctl init` crashes when run concurrently; semget(2) suspected
Hi,
Summary: semget(2) behaves differently on macOS and requires extra care.
I have many tests which spin up clusters using `pg_ctl init`, each in its
own single-use temporary directory. Each test is run for every PostgreSQL
installation found on the host machine. These tests are often run
concurrently. Since adding PostgreSQL 17 to the mix, I've been getting
sporadic failures on macOS:
FATAL: could not create semaphores: Invalid argument
DETAIL: Failed system call was semget(176163502, 20, 03600).
child process exited with exit code 1
I think it's related to the increase of SEMAS_PER_SET
in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted
in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of
semget(2) on macOS.
I think the bug manifests because:
- I create two clusters concurrently using `pg_ctl init`. One cluster is
PostgreSQL 17; the other is PostgreSQL 16 or earlier.
- Their data directories are separate but created close enough in time
to have sequential inodes. This is relevant because the inode is used to
seed the semaphore key.
- Somehow (waves hands) semget(2) in PostgreSQL 17 is called with a key
that points at a preexisting semaphore set. On Linux, due to the IPC_CREAT
| IPC_EXCL flags, this returns <0 and sets errno to EEXIST. On macOS, it
sets it instead to EINVAL, likely because the requested number of
semaphores is greater than those in the existing set. This is in the
InternalIpcSemaphoreCreate function, which then aborts the process.
The attached patch fixes the issue, I think, and has another description of
this mechanism. On EINVAL it adds an additional call to semget(2) but for
zero semaphores.
The patch is relative to master, but I developed it against REL_17_5; it
should apply cleanly to both. I think it would be good to backport a fix to
17 too.
If anyone is feeling nerd-sniped, some better proof of the "waves hands"
bit would be useful, because that was a working hypothesis that led to a
working fix, and I have not yet had time to investigate further.
Please consider the patch for review. Thanks!
Gavin.
Attachments:
v1-0001-When-getting-EINVAL-from-semget-2-probe-for-EEXIS.patchapplication/octet-stream; name=v1-0001-When-getting-EINVAL-from-semget-2-probe-for-EEXIS.patchDownload+35-1
Gavin Panella <gavinpanella@gmail.com> writes:
I have many tests which spin up clusters using `pg_ctl init`, each in its
own single-use temporary directory. Each test is run for every PostgreSQL
installation found on the host machine. These tests are often run
concurrently. Since adding PostgreSQL 17 to the mix, I've been getting
sporadic failures on macOS:
FATAL: could not create semaphores: Invalid argument
DETAIL: Failed system call was semget(176163502, 20, 03600).
child process exited with exit code 1
Ugh.
I think it's related to the increase of SEMAS_PER_SET
in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted
in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of
semget(2) on macOS.
Yeah, I guess you're right; but it's not macOS-specific AFAICS.
"man semget" quoth
[EINVAL] The number of semaphores requested is either less than
1 or greater than the system imposed maximum per set.
[EINVAL] A semaphore identifier exists for the argument key,
but the number of semaphores in the set associated
with it is less than nsems, and nsems is non-zero.
This is from current macOS, but equivalent text appears on Linux and
in the POSIX spec. So it's just luck that nobody has reported the
same problem elsewhere --- unless maybe there is some macOS-specific
behavior making it more likely that different installs would try the
same key. Nonetheless, key collisions must be expected, and we have
little right to assume that a pre-existing semaphore set must have
enough semaphores.
Using the same errno for these cases was really quite poorly chosen,
because in the first case we should not retry with another key, but
in the second case we should. If the problem is that SEMMSL is too
small, and we retry, it'll be an infinite loop.
I don't care for your patch because (a) I'm not sure that the result
of asking for zero semaphores is portable (the Linux man page suggests
that it might be allowed there); and (b) if the problem is SEMMSL,
your code will fall into the infinite loop.
My first thought about fixing it is to go ahead and retry after
EINVAL, but keep a count of failures, and give up after 100 or 1000
or so. This'd be safer in any case --- it's not really great that
IpcSemaphoreCreate is willing to retry an infinite number of times.
If we do retry, the semget() in IpcSemaphoreCreate will fail again,
so we'll just skip over that semaphore key and try another. It
would be marginally nicer to release the semaphore set if it's
unused, but I'm not sure it'd be safe to assume that a set with
too few semaphores is one of ours. It certainly didn't come from
a prior cycle of life of the same postmaster.
regards, tom lane
I wrote:
This is from current macOS, but equivalent text appears on Linux and
in the POSIX spec. So it's just luck that nobody has reported the
same problem elsewhere --- unless maybe there is some macOS-specific
behavior making it more likely that different installs would try the
same key.
Hmm, no, there is a platform dependency here. I made the attached
test program to see what happens when there's a key collision,
and on Linux I get
semget(SEMAS_PER_SET) failed: File exists
semget(SEMAS_PER_SET + 1) failed: File exists
but macOS and NetBSD give
semget(SEMAS_PER_SET) failed: File exists
semget(SEMAS_PER_SET + 1) failed: Invalid argument
I didn't try other BSDen; this might be a NetBSD-ism that Apple
inherited, or maybe it's common among the BSDen. I don't see any
text in POSIX specifying which errno is to be returned in this case,
so we can't really argue that the behavior is wrong.
regards, tom lane
Attachments:
sematest.ctext/x-c; charset=us-ascii; name=sematest.cDownload
I wrote:
I didn't try other BSDen; this might be a NetBSD-ism that Apple
inherited, or maybe it's common among the BSDen.
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.
For Linux and FreeBSD, it doesn't actually matter to us because
we no longer use SysV semaphores on those platforms. But this
is a live problem on macOS, NetBSD, OpenBSD.
regards, tom lane
I wrote:
My first thought about fixing it is to go ahead and retry after
EINVAL, but keep a count of failures, and give up after 100 or 1000
or so. This'd be safer in any case --- it's not really great that
IpcSemaphoreCreate is willing to retry an infinite number of times.
Concretely, about like the attached. I don't have any clear idea
about what loop limit to use, but I think we do need one.
regards, tom lane
Attachments:
cope-with-EINVAL-from-semget.patchtext/x-diff; charset=us-ascii; name=cope-with-EINVAL-from-semget.patchDownload+33-13
With that fix applied to REL_17_5 things are working well. Limiting the
search sounds like an improvement too.
As an experiment I added a log for when semget in
InternalIpcSemaphoreCreate returns -1. When I'm running `pg_ctl init` for
this local build concurrently with `pg_ctl init` from PostgreSQL 15 (or
another version prior to 17), I saw ~8 logged failures when there was
contention. As I increased the concurrency, the maximum number of logged
failures looked to be ~8 times concurrency, roughly. For me, then, running
`pg_ctl init` with a concurrency of 125 would be needed to even begin
exceeding the max retries of 1000 – in the worst case. That sounds high
enough.
Then I thought: I'm only seeing the log from one of those instances, yet
they're all going through the same search for free semaphore sets. That's a
few system calls going to waste. Maybe not important in the big picture,
but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores,
i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few
guaranteed uncontested keys (at least, uncontested between themselves). In
a small test this eliminated contention for semaphore sets due to
concurrency. It is more of an optimisation though, rather than a bug fix.
Gavin
Gavin Panella <gavinpanella@gmail.com> writes:
With that fix applied to REL_17_5 things are working well. Limiting the
search sounds like an improvement too.
Cool. I'll push the fix after our release freeze lifts.
Then I thought: I'm only seeing the log from one of those instances, yet
they're all going through the same search for free semaphore sets. That's a
few system calls going to waste. Maybe not important in the big picture,
but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores,
i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few
guaranteed uncontested keys (at least, uncontested between themselves). In
a small test this eliminated contention for semaphore sets due to
concurrency. It is more of an optimisation though, rather than a bug fix.
Meh ... if we thought this was worth worrying about, I'd be more
inclined to run the st_ino through a hash function to spread out the
possible values a little more. But I think in nearly all scenarios
it'd be a waste of effort. If collisions were probable we'd have
found this issue years ago.
regards, tom lane
On Tue, 12 Aug 2025 at 01:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cool. I'll push the fix after our release freeze lifts.
Splendid, thank you!
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.
Out of curiosity, FreeBSD changed that in 2016:
https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
Thomas Munro <thomas.munro@gmail.com> writes:
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.
Out of curiosity, FreeBSD changed that in 2016:
https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
Hah, so that lends weight to the theory that this is an ancient
Berkeley mistake. I wonder how hard it would be to convince
POSIX that they should be more specific about which errno is
expected here.
In the meantime, though, we'll have to deal with the existing
behavior for years to come. So I'll go ahead with that patch.
I like having a loop limit there anyway --- I was never exactly
convinced that retrying indefinitely was a good idea.
regards, tom lane
On Wed, Aug 13, 2025 at 8:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In the meantime, though, we'll have to deal with the existing
behavior for years to come. So I'll go ahead with that patch.
I like having a loop limit there anyway --- I was never exactly
convinced that retrying indefinitely was a good idea.
LGTM. 1000 does seem like "enough" following the arguments made
earlier. Gavin does also raise a good point that it might be nice to
randomise the search a bit but that sounds like another patch for
master.
It's not a very nice interface and I hope to get rid of it one day.
When I'm working on a Mac I frequently finish up with leaked SysV IPC
entries of both kinds and have to clean them up manually, though
admittedly while hacking and debugging PostgreSQL you exit in weird
and wonderful ways.
FWIW in early prototype multithreading patches you can just use
sem_init() on all these systems since you don't need pshared=1. It
works, but macOS spews "deprecated" warnings when you compile. Kinda
weird for the best semaphore API IMHO. But you can also just
implement LWLock on top of pthread_rwlock_t or futexes with some more
work, so I'm not bothered about that and there are more important
problems to solve in that project for now...
Here's a story from an Apple engineer about why they don't have pshared=1:
https://www.quora.com/Why-does-OS-X-not-support-unnamed-semaphores
I don't actually see any reason they couldn't adopt the same trick as
other modern systems, without an ABI break: use the *address* to
identify semaphores, converted to {VM object, offset} to form the key
in a hash table of waitlists that only has an entry while someone is
actually waiting. No expensive kernel resources while not being
waited on, no leaky SysV namespace management, ideal access control by
memory map. But you probably have to move to California if you want
to fix that.
They also have the CMU Mach semaphore API, semaphore_create(), which
some other projects seem to be using. I think it could be made to
work, but Mach ports are not inherited by child processes so you'd
have to send them explicitly after fork() which sounds painful and
slow, not to mention that the documentation for all that is thin on
the ground...
On Wed, Aug 13, 2025 at 11:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
FWIW in early prototype multithreading patches you can just use
sem_init() on all these systems since you don't need pshared=1.
Oops, I misremembered that. It works on NetBSD and OpenBSD, but not
macOS :-(.
Hmm, but it looks like there has been a new development in macOS 15.
They have finally made the futex API public:
https://developer.apple.com/documentation/os/os_sync_wait_on_address
That was one of the sticking points for my patch that added our own
replacement sem_init() implementation. At the time the only "proper"
way to reach the undocumented and undeclared
__ulock_wait()/__ulock_wake() functions was to write a module in C++
using std::atomic<int>::wait()/notify(), which I had started to wonder
about, but here it is with a shiny new C interface and documentation.
Maybe I should have another go at that!
On Wed, Aug 13, 2025 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Maybe I should have another go at that!
Here's a new attempt at that. It picks futexes automatically, unless
you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
sysv as before.
Attachments:
v1-0001-Use-futex-based-semaphores-on-macOS.patchapplication/octet-stream; name=v1-0001-Use-futex-based-semaphores-on-macOS.patchDownload+298-14
On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a new attempt at that. It picks futexes automatically, unless
you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
sysv as before.
I compared my hand-rolled atomics logic against FreeBSD's user space
semaphores[1]https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c. sem_post() and sem_trywait() are about the same, and I
guess there just aren't many ways to write those. Minor differences:
1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.
2. In sem_trywait() they reload the value while our
pg_atomic_compare_exchange_u32() does that for you, and I also prefer
to avoid assignment expression syntax if I can.
On the other hand my v1 sem_wait() seemed a little janky and weird in
comparison to theirs, and I was a bit worried about bugs. I thought
about rewriting it to look more like that, but then I realised that I
could use sem_trywait() to implement sem_wait() for the same net
effect. It's much shorter and sweeter and more intuitively
understandable, I think?
I also didn't bother to supply a couple of functions that posix_sema.c
doesn't use.
There's no immediately obvious performance difference, but I wasn't
expecting there to be: it's used in the slow path of our LWLocks and
they already have similar user space atomics in front, so you don't
really avoid entering the kernel with this technique except perhaps
just occasionally if you wait just as someone posts. This is all just
about the convenience of bannishing System V cruft. It doesn't seem
to be any slower though, in simple testing, and CI might even be a bit
faster based on spot checks of a few runs... but proper statistical
tools might be needed to see if that is a real phenomenon.
Anyone got any better ideas for how to organise the build scripts,
control the feature, naming etc? I came up with is this, which
automatically falls back to sysv if it can't find what it needs
(respecting Apple's deployment target thing following what we did for
preadv):
meson.build: sema_type = "unnamed_posix+emulation"
configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
The pg_futex.h abstraction layer doesn't have any support for other
OSes yet because I tried to pare this patch down to the absolute
minimum to solve this problem, but I did try at least a little bit to
anticipate that (having removed it from earlier versions from a few
years ago) when designing the futex contracts. We could trivially add
at least OpenBSD support to use emulated semaphores there. I never
quite figured out whether the NetBSD futex API is really available
outside its Linux syscall emulator, but that might be doable too. And
futexes might of course turn out to have applications in other
synchronisation primitives.
Any thoughts on whether this is worth pursuing, and what kinds of
validation might be useful?
[1]: https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c
Attachments:
v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patchapplication/octet-stream; name=v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patchDownload+289-14
On Aug 12, 2025, at 4:07 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.Out of curiosity, FreeBSD changed that in 2016:
https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
I believe that this is the code in macOS's "XNU" [1]https://github.com/apple-oss-distributions/xnu/blob/e3723e1f17661b24996789d8afc084c0c3303b26/bsd/kern/sysv_sem.c#L860-L889[2]"XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and a C++ API for writing drivers called IOKit." which has similar
logic (unsurprisingly) to the FreeBSD code before the identified commit [3]https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e.
best.
-greg
[1]: https://github.com/apple-oss-distributions/xnu/blob/e3723e1f17661b24996789d8afc084c0c3303b26/bsd/kern/sysv_sem.c#L860-L889
[2]: "XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and a C++ API for writing drivers called IOKit."
[3]: https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
On Aug 13, 2025, at 8:01 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a new attempt at that. It picks futexes automatically, unless
you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
sysv as before.
Thomas, thanks for digging into this platform specific issue. I've read
your patch and done some research. Thoughts below...
I compared my hand-rolled atomics logic against FreeBSD's user space
semaphores[1]. sem_post() and sem_trywait() are about the same, and I
guess there just aren't many ways to write those. Minor differences:1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.
The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1]https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/sem_post.html says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
go after all?
2. In sem_trywait() they reload the value while our
pg_atomic_compare_exchange_u32() does that for you, and I also prefer
to avoid assignment expression syntax if I can.On the other hand my v1 sem_wait() seemed a little janky and weird in
comparison to theirs, and I was a bit worried about bugs. I thought
about rewriting it to look more like that, but then I realised that I
could use sem_trywait() to implement sem_wait() for the same net
effect. It's much shorter and sweeter and more intuitively
understandable, I think?
So if I understand it the way your sem_trywait() works is:
* Try non-blocking acquire with pg_sem_trywait()
* If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
* When woken up: Loop back to try again
This is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.
I also didn't bother to supply a couple of functions that posix_sema.c
doesn't use.
I think that's fine for now.
There's no immediately obvious performance difference, but I wasn't
expecting there to be: it's used in the slow path of our LWLocks and
they already have similar user space atomics in front, so you don't
really avoid entering the kernel with this technique except perhaps
just occasionally if you wait just as someone posts. This is all just
about the convenience of bannishing System V cruft. It doesn't seem
to be any slower though, in simple testing, and CI might even be a bit
faster based on spot checks of a few runs... but proper statistical
tools might be needed to see if that is a real phenomenon.
That lines up with my intuition after reading the patch. I don't know
that it is even worth diving too deep in performance testing this.
Anyone got any better ideas for how to organise the build scripts,
control the feature, naming etc? I came up with is this, which
automatically falls back to sysv if it can't find what it needs
(respecting Apple's deployment target thing following what we did for
preadv):meson.build: sema_type = "unnamed_posix+emulation"
configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
I'm not an expert on Meson, but I do have years of Autoconf work under
my belt. Offhand this seems reasonable, but I'm new to this project.
The pg_futex.h abstraction layer doesn't have any support for other
OSes yet because I tried to pare this patch down to the absolute
minimum to solve this problem, but I did try at least a little bit to
anticipate that (having removed it from earlier versions from a few
years ago) when designing the futex contracts. We could trivially add
at least OpenBSD support to use emulated semaphores there. I never
quite figured out whether the NetBSD futex API is really available
outside its Linux syscall emulator, but that might be doable too. And
futexes might of course turn out to have applications in other
synchronisation primitives.Any thoughts on whether this is worth pursuing, and what kinds of
validation might be useful?[1] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c
<v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patch>
Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.
best.
-greg
[1]: https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/sem_post.html
On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote:
1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
go after all?
Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was
looking at one of the earlier versions. But I see now that POSIX 2024
has just added it! So the question is: do we care?
Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess). I don't know
if systems with a low enough value exist in the wild, but if this
patch were to define its own PG_SEM_VALUE_MAX and then redefine
SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would
naturally use UINT32_MAX, and then that check would be tautological
given the types involved, not to mention our own lower limits. So
perhaps it's not worth more than a comment?
I've found only very brief discussions in the standards community so far:
https://austingroupbugs.net/view.php?id=315
https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow
So if I understand it the way your sem_trywait() works is:
* Try non-blocking acquire with pg_sem_trywait()
* If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
* When woken up: Loop back to try againThis is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.
Yeah. Linux gave us the name futex and seems to have established a
sort of de facto standard semantics for it, as seen here in macOS.
It's not the only way to skin that cat... IIUC there were various
older systems with the double-check to avoid the race against user
space, and "address as wait channel" wasn't new either, but I think at
least some of the other things like that might have required a
waitlist in user space; I'm not too sure about that history though, a
lot of it was probably closed source. It's not even too clear to me
how that waitlist tradeoff works out, but in any case, that is exactly
how a fallback implementation would probably work if we ever decided
we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
sizeof(uint32) and maybe even do away with the struct and just use
pg_atomic_uint32 directly then you could hash the address to select a
cv from a table of (say) 256 of them, with the annoying detail that
you'd probably need a separate wake/wait function pair for {DSM
segment handle, offset} for memory mapped at different addresses. But
we don't need to make decisions about any of that vapourware to solve
this immediate single-platform problem.
Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if
you tried it.
Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.
Thanks for looking! And especially for the sanity check on the
definitions of sem_wait()/sem_trywait().
On Sat, Aug 16, 2025 at 2:25 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess).
And just by the way, each backend has its own semaphore, so in actual
usage we're probably only talking about the "superfluous wakeups"
mentioned in lwlock.c, clog.c and procarray.c. I suppose it's not
expected to go very high at all? I was just trying to think about
where the bounds on that come from in theory, while working through
the case for ignoring EOVERFLOW...
Thomas Munro <thomas.munro@gmail.com> writes:
And just by the way, each backend has its own semaphore, so in actual
usage we're probably only talking about the "superfluous wakeups"
mentioned in lwlock.c, clog.c and procarray.c. I suppose it's not
expected to go very high at all?
I wouldn't expect so.
regards, tom lane
On Aug 15, 2025, at 10:25 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote:
1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
go after all?Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was
looking at one of the earlier versions. But I see now that POSIX 2024
has just added it! So the question is: do we care?
I’m on the fence about "caring", if your code fixes the issue we could
simply add a comment about eventually needing to support EOVERFLOW. I
don't think what you have in the patch is wrong and we're solving our
issue, not the worlds inconsistencies.
Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess). I don't know
if systems with a low enough value exist in the wild, but if this
patch were to define its own PG_SEM_VALUE_MAX and then redefine
SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would
naturally use UINT32_MAX, and then that check would be tautological
given the types involved, not to mention our own lower limits. So
perhaps it's not worth more than a comment?I've found only very brief discussions in the standards community so far:
https://austingroupbugs.net/view.php?id=315
https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow
I see what you mean, a comment that captures this decision and we're
done IMO.
So if I understand it the way your sem_trywait() works is:
* Try non-blocking acquire with pg_sem_trywait()
* If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
* When woken up: Loop back to try againThis is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.Yeah. Linux gave us the name futex and seems to have established a
sort of de facto standard semantics for it, as seen here in macOS.
Indeed, the "fast userspace mutex". *sigh* ;-P
I was calling this out because an implementation of sem_trywait() that
used sem_wait() that wasn't based on futex would be less correct AFAICT.
It's not the only way to skin that cat... IIUC there were various
older systems with the double-check to avoid the race against user
space, and "address as wait channel" wasn't new either, but I think at
least some of the other things like that might have required a
waitlist in user space; I'm not too sure about that history though, a
lot of it was probably closed source. It's not even too clear to me
how that waitlist tradeoff works out, but in any case, that is exactly
how a fallback implementation would probably work if we ever decided
we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
sizeof(uint32) and maybe even do away with the struct and just use
pg_atomic_uint32 directly then you could hash the address to select a
cv from a table of (say) 256 of them, with the annoying detail that
you'd probably need a separate wake/wait function pair for {DSM
segment handle, offset} for memory mapped at different addresses. But
we don't need to make decisions about any of that vapourware to solve
this immediate single-platform problem.
Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if
you tried it.
Oops, yep. You are right. :)
Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.Thanks for looking! And especially for the sanity check on the
definitions of sem_wait()/sem_trywait().
Sure thing, anytime. Not sure I helped much...
-greg