`pg_ctl init` crashes when run concurrently; semget(2) suspected
Hi,
Summary: semget(2) behaves differently on macOS and requires extra care.
I have many tests which spin up clusters using `pg_ctl init`, each in its
own single-use temporary directory. Each test is run for every PostgreSQL
installation found on the host machine. These tests are often run
concurrently. Since adding PostgreSQL 17 to the mix, I've been getting
sporadic failures on macOS:
FATAL: could not create semaphores: Invalid argument
DETAIL: Failed system call was semget(176163502, 20, 03600).
child process exited with exit code 1
I think it's related to the increase of SEMAS_PER_SET
in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted
in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of
semget(2) on macOS.
I think the bug manifests because:
- I create two clusters concurrently using `pg_ctl init`. One cluster is
PostgreSQL 17; the other is PostgreSQL 16 or earlier.
- Their data directories are separate but created close enough in time
to have sequential inodes. This is relevant because the inode is used to
seed the semaphore key.
- Somehow (waves hands) semget(2) in PostgreSQL 17 is called with a key
that points at a preexisting semaphore set. On Linux, due to the IPC_CREAT
| IPC_EXCL flags, this returns <0 and sets errno to EEXIST. On macOS, it
sets it instead to EINVAL, likely because the requested number of
semaphores is greater than those in the existing set. This is in the
InternalIpcSemaphoreCreate function, which then aborts the process.
The attached patch fixes the issue, I think, and has another description of
this mechanism. On EINVAL it adds an additional call to semget(2) but for
zero semaphores.
The patch is relative to master, but I developed it against REL_17_5; it
should apply cleanly to both. I think it would be good to backport a fix to
17 too.
If anyone is feeling nerd-sniped, some better proof of the "waves hands"
bit would be useful, because that was a working hypothesis that led to a
working fix, and I have not yet had time to investigate further.
Please consider the patch for review. Thanks!
Gavin.
Attachments:
v1-0001-When-getting-EINVAL-from-semget-2-probe-for-EEXIS.patchapplication/octet-stream; name=v1-0001-When-getting-EINVAL-from-semget-2-probe-for-EEXIS.patchDownload
From c91fcee144f86be2881eb48f215d26fcfa459c2d Mon Sep 17 00:00:00 2001
From: Gavin Panella <gavinpanella@gmail.com>
Date: Sun, 10 Aug 2025 19:30:19 +0200
Subject: [PATCH v1] When getting EINVAL from semget(2), probe for EEXIST
---
src/backend/port/sysv_sema.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 423b2b4f9d6..51308a651cf 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -113,6 +113,41 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
)
return -1;
+ /*
+ * Typically EINVAL means that the number of semaphores requested is
+ * greater than the system imposed maximum per set, or that the set
+ * exists but that it has fewer than numSems semaphores in it. In the
+ * latter case, presumably the set was allocated by another program or
+ * perhaps by another version of PostgreSQL (SEMAS_PER_SET was raised
+ * in version 17, for example).
+ *
+ * On many platforms, including Linux, since the flags passed to
+ * semget(2) includes IPC_CREAT | IPC_EXCL, it would return <0 and set
+ * errno to EEXIST when numSems is greater than the number of
+ * semaphores in a preexisting set.
+ *
+ * On some platforms, e.g. macOS 15.6, when encountering an existing
+ * set, it appears that numSems is checked first, before considering
+ * flags. This means that when numSems is greater than the number of
+ * semaphores in a preexisting set, semget(2) returns <0 and errno is
+ * set to EINVAL. Hence, on these platforms, EINVAL can mean we've hit
+ * a system limit and thus we cannot continue, OR it can mean we've
+ * found an existing set, a benign and anticipated situation; see how
+ * EEXIST is handled above.
+ *
+ * Fortunately we can distinguish these by calling semget(2) again,
+ * asking for _zero_ semaphores. There is a potential race, e.g. the
+ * semaphore set could be removed, but if we take the stance that we
+ * will not use this set in any case, we need only check that errno is
+ * _not_ EINVAL this time.
+ */
+ if (saved_errno == EINVAL)
+ {
+ semId = semget(semKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection);
+ if (semId >= 0 || errno != EINVAL)
+ return -1;
+ }
+
/*
* Else complain and abort
*/
--
2.50.1
Gavin Panella <gavinpanella@gmail.com> writes:
I have many tests which spin up clusters using `pg_ctl init`, each in its
own single-use temporary directory. Each test is run for every PostgreSQL
installation found on the host machine. These tests are often run
concurrently. Since adding PostgreSQL 17 to the mix, I've been getting
sporadic failures on macOS:
FATAL: could not create semaphores: Invalid argument
DETAIL: Failed system call was semget(176163502, 20, 03600).
child process exited with exit code 1
Ugh.
I think it's related to the increase of SEMAS_PER_SET
in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted
in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of
semget(2) on macOS.
Yeah, I guess you're right; but it's not macOS-specific AFAICS.
"man semget" quoth
[EINVAL] The number of semaphores requested is either less than
1 or greater than the system imposed maximum per set.
[EINVAL] A semaphore identifier exists for the argument key,
but the number of semaphores in the set associated
with it is less than nsems, and nsems is non-zero.
This is from current macOS, but equivalent text appears on Linux and
in the POSIX spec. So it's just luck that nobody has reported the
same problem elsewhere --- unless maybe there is some macOS-specific
behavior making it more likely that different installs would try the
same key. Nonetheless, key collisions must be expected, and we have
little right to assume that a pre-existing semaphore set must have
enough semaphores.
Using the same errno for these cases was really quite poorly chosen,
because in the first case we should not retry with another key, but
in the second case we should. If the problem is that SEMMSL is too
small, and we retry, it'll be an infinite loop.
I don't care for your patch because (a) I'm not sure that the result
of asking for zero semaphores is portable (the Linux man page suggests
that it might be allowed there); and (b) if the problem is SEMMSL,
your code will fall into the infinite loop.
My first thought about fixing it is to go ahead and retry after
EINVAL, but keep a count of failures, and give up after 100 or 1000
or so. This'd be safer in any case --- it's not really great that
IpcSemaphoreCreate is willing to retry an infinite number of times.
If we do retry, the semget() in IpcSemaphoreCreate will fail again,
so we'll just skip over that semaphore key and try another. It
would be marginally nicer to release the semaphore set if it's
unused, but I'm not sure it'd be safe to assume that a set with
too few semaphores is one of ours. It certainly didn't come from
a prior cycle of life of the same postmaster.
regards, tom lane
I wrote:
This is from current macOS, but equivalent text appears on Linux and
in the POSIX spec. So it's just luck that nobody has reported the
same problem elsewhere --- unless maybe there is some macOS-specific
behavior making it more likely that different installs would try the
same key.
Hmm, no, there is a platform dependency here. I made the attached
test program to see what happens when there's a key collision,
and on Linux I get
semget(SEMAS_PER_SET) failed: File exists
semget(SEMAS_PER_SET + 1) failed: File exists
but macOS and NetBSD give
semget(SEMAS_PER_SET) failed: File exists
semget(SEMAS_PER_SET + 1) failed: Invalid argument
I didn't try other BSDen; this might be a NetBSD-ism that Apple
inherited, or maybe it's common among the BSDen. I don't see any
text in POSIX specifying which errno is to be returned in this case,
so we can't really argue that the behavior is wrong.
regards, tom lane
Attachments:
I wrote:
I didn't try other BSDen; this might be a NetBSD-ism that Apple
inherited, or maybe it's common among the BSDen.
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.
For Linux and FreeBSD, it doesn't actually matter to us because
we no longer use SysV semaphores on those platforms. But this
is a live problem on macOS, NetBSD, OpenBSD.
regards, tom lane
I wrote:
My first thought about fixing it is to go ahead and retry after
EINVAL, but keep a count of failures, and give up after 100 or 1000
or so. This'd be safer in any case --- it's not really great that
IpcSemaphoreCreate is willing to retry an infinite number of times.
Concretely, about like the attached. I don't have any clear idea
about what loop limit to use, but I think we do need one.
regards, tom lane
Attachments:
cope-with-EINVAL-from-semget.patchtext/x-diff; charset=us-ascii; name=cope-with-EINVAL-from-semget.patchDownload
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 423b2b4f9d6..8984a0aa2d1 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -69,7 +69,7 @@ static int nextSemaNumber; /* next free sem num in last sema set */
static IpcSemaphoreId InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey,
- int numSems);
+ int numSems, bool retry_ok);
static void IpcSemaphoreInitialize(IpcSemaphoreId semId, int semNum,
int value);
static void IpcSemaphoreKill(IpcSemaphoreId semId);
@@ -88,9 +88,13 @@ static void ReleaseSemaphores(int status, Datum arg);
* If we fail with a failure code other than collision-with-existing-set,
* print out an error and abort. Other types of errors suggest nonrecoverable
* problems.
+ *
+ * Unfortunately, it's sometimes hard to tell whether errors are
+ * nonrecoverable. Our caller keeps track of whether continuing to retry
+ * is sane or not; if not, we abort on failure regardless of the errno.
*/
static IpcSemaphoreId
-InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
+InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems, bool retry_ok)
{
int semId;
@@ -101,16 +105,27 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
int saved_errno = errno;
/*
- * Fail quietly if error indicates a collision with existing set. One
- * would expect EEXIST, given that we said IPC_EXCL, but perhaps we
- * could get a permission violation instead? Also, EIDRM might occur
- * if an old set is slated for destruction but not gone yet.
+ * Fail quietly if error suggests a collision with an existing set and
+ * our caller has not lost patience.
+ *
+ * One would expect EEXIST, given that we said IPC_EXCL, but perhaps
+ * we could get a permission violation instead. On some platforms
+ * EINVAL will be reported if the existing set has too few semaphores.
+ * Also, EIDRM might occur if an old set is slated for destruction but
+ * not gone yet.
+ *
+ * EINVAL is the key reason why we need the caller-level loop limit,
+ * as it can also mean that the platform's SEMMSL is less than
+ * numSems, and that condition will not go away.
*/
- if (saved_errno == EEXIST || saved_errno == EACCES
+ if (retry_ok &&
+ (saved_errno == EEXIST
+ || saved_errno == EACCES
+ || saved_errno == EINVAL
#ifdef EIDRM
- || saved_errno == EIDRM
+ || saved_errno == EIDRM
#endif
- )
+ ))
return -1;
/*
@@ -207,17 +222,22 @@ IpcSemaphoreGetLastPID(IpcSemaphoreId semId, int semNum)
static IpcSemaphoreId
IpcSemaphoreCreate(int numSems)
{
+ int num_tries = 0;
IpcSemaphoreId semId;
union semun semun;
PGSemaphoreData mysema;
/* Loop till we find a free IPC key */
- for (nextSemaKey++;; nextSemaKey++)
+ for (nextSemaKey++;; nextSemaKey++, num_tries++)
{
pid_t creatorPID;
- /* Try to create new semaphore set */
- semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1);
+ /*
+ * Try to create new semaphore set. Give up after trying 1000
+ * distinct IPC keys.
+ */
+ semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1,
+ num_tries < 1000);
if (semId >= 0)
break; /* successful create */
@@ -254,7 +274,7 @@ IpcSemaphoreCreate(int numSems)
/*
* Now try again to create the sema set.
*/
- semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1);
+ semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1, true);
if (semId >= 0)
break; /* successful create */
With that fix applied to REL_17_5 things are working well. Limiting the
search sounds like an improvement too.
As an experiment I added a log for when semget in
InternalIpcSemaphoreCreate returns -1. When I'm running `pg_ctl init` for
this local build concurrently with `pg_ctl init` from PostgreSQL 15 (or
another version prior to 17), I saw ~8 logged failures when there was
contention. As I increased the concurrency, the maximum number of logged
failures looked to be ~8 times concurrency, roughly. For me, then, running
`pg_ctl init` with a concurrency of 125 would be needed to even begin
exceeding the max retries of 1000 – in the worst case. That sounds high
enough.
Then I thought: I'm only seeing the log from one of those instances, yet
they're all going through the same search for free semaphore sets. That's a
few system calls going to waste. Maybe not important in the big picture,
but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores,
i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few
guaranteed uncontested keys (at least, uncontested between themselves). In
a small test this eliminated contention for semaphore sets due to
concurrency. It is more of an optimisation though, rather than a bug fix.
Gavin
Gavin Panella <gavinpanella@gmail.com> writes:
With that fix applied to REL_17_5 things are working well. Limiting the
search sounds like an improvement too.
Cool. I'll push the fix after our release freeze lifts.
Then I thought: I'm only seeing the log from one of those instances, yet
they're all going through the same search for free semaphore sets. That's a
few system calls going to waste. Maybe not important in the big picture,
but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores,
i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few
guaranteed uncontested keys (at least, uncontested between themselves). In
a small test this eliminated contention for semaphore sets due to
concurrency. It is more of an optimisation though, rather than a bug fix.
Meh ... if we thought this was worth worrying about, I'd be more
inclined to run the st_ino through a hash function to spread out the
possible values a little more. But I think in nearly all scenarios
it'd be a waste of effort. If collisions were probable we'd have
found this issue years ago.
regards, tom lane
On Tue, 12 Aug 2025 at 01:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cool. I'll push the fix after our release freeze lifts.
Splendid, thank you!
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.
Out of curiosity, FreeBSD changed that in 2016:
https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
Thomas Munro <thomas.munro@gmail.com> writes:
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.
Out of curiosity, FreeBSD changed that in 2016:
https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
Hah, so that lends weight to the theory that this is an ancient
Berkeley mistake. I wonder how hard it would be to convince
POSIX that they should be more specific about which errno is
expected here.
In the meantime, though, we'll have to deal with the existing
behavior for years to come. So I'll go ahead with that patch.
I like having a loop limit there anyway --- I was never exactly
convinced that retrying indefinitely was a good idea.
regards, tom lane
On Wed, Aug 13, 2025 at 8:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In the meantime, though, we'll have to deal with the existing
behavior for years to come. So I'll go ahead with that patch.
I like having a loop limit there anyway --- I was never exactly
convinced that retrying indefinitely was a good idea.
LGTM. 1000 does seem like "enough" following the arguments made
earlier. Gavin does also raise a good point that it might be nice to
randomise the search a bit but that sounds like another patch for
master.
It's not a very nice interface and I hope to get rid of it one day.
When I'm working on a Mac I frequently finish up with leaked SysV IPC
entries of both kinds and have to clean them up manually, though
admittedly while hacking and debugging PostgreSQL you exit in weird
and wonderful ways.
FWIW in early prototype multithreading patches you can just use
sem_init() on all these systems since you don't need pshared=1. It
works, but macOS spews "deprecated" warnings when you compile. Kinda
weird for the best semaphore API IMHO. But you can also just
implement LWLock on top of pthread_rwlock_t or futexes with some more
work, so I'm not bothered about that and there are more important
problems to solve in that project for now...
Here's a story from an Apple engineer about why they don't have pshared=1:
https://www.quora.com/Why-does-OS-X-not-support-unnamed-semaphores
I don't actually see any reason they couldn't adopt the same trick as
other modern systems, without an ABI break: use the *address* to
identify semaphores, converted to {VM object, offset} to form the key
in a hash table of waitlists that only has an entry while someone is
actually waiting. No expensive kernel resources while not being
waited on, no leaky SysV namespace management, ideal access control by
memory map. But you probably have to move to California if you want
to fix that.
They also have the CMU Mach semaphore API, semaphore_create(), which
some other projects seem to be using. I think it could be made to
work, but Mach ports are not inherited by child processes so you'd
have to send them explicitly after fork() which sounds painful and
slow, not to mention that the documentation for all that is thin on
the ground...
On Wed, Aug 13, 2025 at 11:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
FWIW in early prototype multithreading patches you can just use
sem_init() on all these systems since you don't need pshared=1.
Oops, I misremembered that. It works on NetBSD and OpenBSD, but not
macOS :-(.
Hmm, but it looks like there has been a new development in macOS 15.
They have finally made the futex API public:
https://developer.apple.com/documentation/os/os_sync_wait_on_address
That was one of the sticking points for my patch that added our own
replacement sem_init() implementation. At the time the only "proper"
way to reach the undocumented and undeclared
__ulock_wait()/__ulock_wake() functions was to write a module in C++
using std::atomic<int>::wait()/notify(), which I had started to wonder
about, but here it is with a shiny new C interface and documentation.
Maybe I should have another go at that!
On Wed, Aug 13, 2025 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Maybe I should have another go at that!
Here's a new attempt at that. It picks futexes automatically, unless
you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
sysv as before.
Attachments:
v1-0001-Use-futex-based-semaphores-on-macOS.patchapplication/octet-stream; name=v1-0001-Use-futex-based-semaphores-on-macOS.patchDownload
From d823cbde9b90f30f4133538161ef289fc7d97a6e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 24 Oct 2021 21:48:26 +1300
Subject: [PATCH v1] Use futex-based semaphores on macOS.
Provide an implementation of the POSIX unnamed semaphore API when
requested at build time with the following options:
meson.build: sema_type = "unnamed_posix+emulation"
configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
With both build systems, it falls back to sysv if native futexes are not
detected, and only macOS 14.4's new os_sync_wait_on_address() is
supported initially.
Futexes might have more general applications and on more platforms in
later work.
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKAPR%3DNXxMTCmEomCUpq%2BmrPLDVyfVa4bKZqfDjwyzD7Q%40mail.gmail.com
---
configure | 24 +++++++
configure.ac | 8 +++
meson.build | 14 ++++
src/backend/port/Makefile | 3 +-
src/backend/port/meson.build | 3 +
src/backend/port/posix_sema.c | 14 ++++
src/backend/port/sem_init.c | 101 ++++++++++++++++++++++++++
src/include/pg_config.h.in | 7 ++
src/include/port/pg_futex.h | 120 +++++++++++++++++++++++++++++++
src/template/darwin | 15 +---
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 298 insertions(+), 13 deletions(-)
create mode 100644 src/backend/port/sem_init.c
create mode 100644 src/include/port/pg_futex.h
diff --git a/configure b/configure
index 507a2437c33..aba6f39eb38 100755
--- a/configure
+++ b/configure
@@ -16179,6 +16179,21 @@ _ACEOF
# We can't use AC_CHECK_FUNCS to detect these functions, because it
# won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "os_sync_wait_on_address" "ac_cv_have_decl_os_sync_wait_on_address" "#include <os/os_sync_wait_on_address.h>
+"
+if test "x$ac_cv_have_decl_os_sync_wait_on_address" = xyes; then :
+ ac_have_decl=1
+else
+ ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS $ac_have_decl
+_ACEOF
+if test $ac_have_decl = 1; then :
+ HAVE_NATIVE_FUTEX=1
+fi
+
ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
"
if test "x$ac_cv_have_decl_preadv" = xyes; then :
@@ -18419,6 +18434,15 @@ if test "$ac_res" != no; then :
fi
fi
+ if test x"$PREFERRED_SEMAPHORES" = x"UNNAMED_POSIX+EMULATION" ; then
+ # Need futex support for this
+ if test x"$HAVE_NATIVE_FUTEX" = x"1" ; then
+
+$as_echo "#define USE_SEMAPHORE_EMULATION 1" >>confdefs.h
+
+ USE_UNNAMED_POSIX_SEMAPHORES=1
+ fi
+ fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking which semaphore API to use" >&5
$as_echo_n "checking which semaphore API to use... " >&6; }
if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
diff --git a/configure.ac b/configure.ac
index 5f4548adc5c..71f7c77b381 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1841,6 +1841,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen, strsep, timingsafe_bcmp])
# We can't use AC_CHECK_FUNCS to detect these functions, because it
# won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([os_sync_wait_on_address], [HAVE_NATIVE_FUTEX=1], [], [#include <os/os_sync_wait_on_address.h>])
AC_CHECK_DECLS([preadv], [], [], [#include <sys/uio.h>])
AC_CHECK_DECLS([pwritev], [], [], [#include <sys/uio.h>])
AC_CHECK_DECLS([strchrnul], [], [], [#include <string.h>])
@@ -2272,6 +2273,13 @@ if test "$PORTNAME" != "win32"; then
# Need sem_init for this
AC_SEARCH_LIBS(sem_init, [rt pthread], [USE_UNNAMED_POSIX_SEMAPHORES=1])
fi
+ if test x"$PREFERRED_SEMAPHORES" = x"UNNAMED_POSIX+EMULATION" ; then
+ # Need futex support for this
+ if test x"$HAVE_NATIVE_FUTEX" = x"1" ; then
+ AC_DEFINE(USE_SEMAPHORE_EMULATION, 1, [Define to enable semaphore emulation with futexes.])
+ USE_UNNAMED_POSIX_SEMAPHORES=1
+ fi
+ fi
AC_MSG_CHECKING([which semaphore API to use])
if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
AC_DEFINE(USE_NAMED_POSIX_SEMAPHORES, 1, [Define to select named POSIX semaphores.])
diff --git a/meson.build b/meson.build
index 30e0edda3e7..3c360bb5d02 100644
--- a/meson.build
+++ b/meson.build
@@ -207,6 +207,7 @@ if host_system == 'cygwin'
mod_link_with_dir = 'libdir'
elif host_system == 'darwin'
+ sema_kind = 'unnamed_posix+emulation' # reverts to sysv if unsupported
dlsuffix = '.dylib'
library_path_var = 'DYLD_LIBRARY_PATH'
@@ -2660,6 +2661,7 @@ decl_checks = [
# checking for library symbols wouldn't handle deployment target
# restrictions on macOS
decl_checks += [
+ ['os_sync_wait_on_address', 'os/os_sync_wait_on_address.h'],
['preadv', 'sys/uio.h'],
['pwritev', 'sys/uio.h'],
['strchrnul', 'string.h'],
@@ -2968,6 +2970,18 @@ if sema_kind == 'unnamed_posix' and \
sema_kind = 'sysv'
endif
+# for unnamed_posix+emulation, use unnamed_posix and enable emulation, or fall
+# back to sysv if support isn't found
+if sema_kind == 'unnamed_posix+emulation'
+ if cdata.get('HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS') != 0
+ cdata.set('USE_SEMAPHORE_EMULATION', 1,
+ description: 'Define to 1 to enable emulation of unnamed POSIX semaphores')
+ sema_kind = 'unnamed_posix'
+ else
+ sema_kind = 'sysv'
+ endif
+endif
+
cdata.set('USE_@0@_SHARED_MEMORY'.format(shmem_kind.to_upper()), 1)
cdata.set('USE_@0@_SEMAPHORES'.format(sema_kind.to_upper()), 1)
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index 47338d99229..639049e6be7 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -25,7 +25,8 @@ OBJS = \
$(TAS) \
atomics.o \
pg_sema.o \
- pg_shmem.o
+ pg_shmem.o \
+ sem_init.o
ifeq ($(PORTNAME), win32)
SUBDIRS += win32
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 09d54e01d13..a130549ac9a 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -4,6 +4,9 @@ backend_sources += files(
'atomics.c',
)
+if cdata.has('USE_SEMAPHORE_EMULATION')
+ backend_sources += files('sem_init.c')
+endif
if cdata.has('USE_UNNAMED_POSIX_SEMAPHORES') or cdata.has('USE_NAMED_POSIX_SEMAPHORES')
backend_sources += files('posix_sema.c')
diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..707b0615559 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -36,6 +36,20 @@
#include "storage/pg_sema.h"
#include "storage/shmem.h"
+#ifdef USE_SEMAPHORE_EMULATION
+#include "port/pg_semaphore.h"
+
+/*
+ * Redirect to futex-based emulation of the POSIX unnamed API, for systems that
+ * can't do sem_init() with pshared=1.
+ */
+#define sem_t pg_sem_t
+#define sem_init pg_sem_init
+#define sem_destroy pg_sem_destroy
+#define sem_post pg_sem_post
+#define sem_wait pg_sem_wait
+#define sem_trywait pg_sem_trywait
+#endif
/* see file header comment */
#if defined(USE_NAMED_POSIX_SEMAPHORES) && defined(EXEC_BACKEND)
diff --git a/src/backend/port/sem_init.c b/src/backend/port/sem_init.c
new file mode 100644
index 00000000000..dd275c0c052
--- /dev/null
+++ b/src/backend/port/sem_init.c
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ *
+ * sem_init.c
+ *
+ * Drop-in replacement for POSIX sem_init(), sem_post(), sem_wait() and
+ * sem_destroy(). These can be used on systems that don't provide unnamed
+ * semaphores with pshared=1, but do have shared memory futexes that we can use
+ * to build our own.
+ *
+ * pg_sem_t is a typedef for pg_futex_u32 in src/include/port/pg_semaphore.h.
+ * If we wanted to add a waiter count it could become a struct.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/port/sem_init.c
+ *
+ *-------------------------------------------------------------------------/
+ */
+
+#include "postgres.h"
+
+#ifdef USE_SEMAPHORE_EMULATION
+
+#include "port/atomics.h"
+#include "port/pg_futex.h"
+#include "port/pg_semaphore.h"
+
+/*
+ * Initialize a semaphore with a given value.
+ */
+int
+pg_sem_init(pg_sem_t *semaphore, int pshared, int value)
+{
+ pg_atomic_init_u32(&semaphore->value, value);
+ return 0;
+}
+
+/*
+ * Destroy a semaphore.
+ */
+int
+pg_sem_destroy(pg_sem_t *semaphore)
+{
+ return 0;
+}
+
+/*
+ * Increment the semaphore and wake any waiters.
+ *
+ * We use semaphores in contexts where there is always a waiter, so we don't
+ * bother with a waiter count that could suppress useless system calls here.
+ */
+int
+pg_sem_post(pg_sem_t *semaphore)
+{
+ pg_atomic_fetch_add_u32(&semaphore->value, 1);
+ return pg_futex_wake_u32(semaphore, INT_MAX);
+}
+
+/*
+ * Decrement the semaphore, waiting first for it to rise above zero if
+ * necessary.
+ */
+int
+pg_sem_wait(pg_sem_t *semaphore)
+{
+ uint32 value = 1;
+
+ while (!pg_atomic_compare_exchange_u32(&semaphore->value, &value, value - 1))
+ {
+ if (value == 0)
+ {
+ if (pg_futex_wait_u32(semaphore, 0, NULL) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+/*
+ * Decrement the semaphore if it is above zero, or fail with EAGAIN if is
+ * already zero.
+ */
+int
+pg_sem_trywait(pg_sem_t *semaphore)
+{
+ uint32 value = pg_atomic_read_u32(&semaphore->value);
+
+ while (value > 0)
+ {
+ if (pg_atomic_compare_exchange_u32(&semaphore->value, &value, value - 1))
+ return 0;
+ }
+
+ errno = EAGAIN;
+ return -1;
+}
+
+#endif /* HAVE_EMULATED_SEMAPHORES */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c4dc5d72bdb..905499a2933 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -95,6 +95,10 @@
don't. */
#undef HAVE_DECL_MEMSET_S
+/* Define to 1 if you have the declaration of `os_sync_wait_on_address', and
+ to 0 if you don't. */
+#undef HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+
/* Define to 1 if you have the declaration of `posix_fadvise', and to 0 if you
don't. */
#undef HAVE_DECL_POSIX_FADVISE
@@ -727,6 +731,9 @@
/* Define to 1 to build with PAM support. (--with-pam) */
#undef USE_PAM
+/* Define to enable semaphore emulation with futexes. */
+#undef USE_SEMAPHORE_EMULATION
+
/* Define to 1 to use software CRC-32C implementation (slicing-by-8). */
#undef USE_SLICING_BY_8_CRC32C
diff --git a/src/include/port/pg_futex.h b/src/include/port/pg_futex.h
new file mode 100644
index 00000000000..6fae7c7e810
--- /dev/null
+++ b/src/include/port/pg_futex.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_futex.h
+ *
+ * Futex abstraction. Initially supported only on macOS, but it could be
+ * extended to other systems. A wrapper struct and initialization function are
+ * used to reserve the option of providing a fallback implementation.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/pg_futex.h
+ *
+ *-------------------------------------------------------------------------/
+ */
+#ifndef PG_FUTEX_H
+#define PG_FUTEX_H
+
+#ifdef FRONTEND
+#error "not for use in frontend code"
+#endif
+
+#if HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+#include <os/os_sync_wait_on_address.h>
+#define HAVE_PG_FUTEX_U32
+#endif
+
+#ifdef HAVE_PG_FUTEX_U32
+
+#include "port/atomics.h"
+
+#include <time.h>
+
+typedef struct pg_futex_u32
+{
+ pg_atomic_uint32 value;
+} pg_futex_u32;
+
+static inline void
+pg_futex_init_u32(pg_futex_u32 *futex, uint32 value)
+{
+ pg_atomic_init_u32(&futex->value, value);
+}
+
+/*
+ * Return 0 immediately if futex->value is not equal to the expected value, and
+ * otherwise wait to be woken up explicitly by a thread that has changed the
+ * value, and then return 0. Return -1 and set errno on error. ETIMEDOUT
+ * indicates that the optional timeout has been reached.
+ *
+ * Note for future implementations: not all systems distinguish between value
+ * check failure and being woken up, so this function returns 0 in both cases.
+ */
+static inline int
+pg_futex_wait_u32(pg_futex_u32 *futex, uint32 expected, struct timespec *timeout)
+{
+ /*
+ * Our atomic 32 bit integers are just wrapped integers, so it is safe for
+ * the kernel to read them when it performs its value check.
+ */
+ StaticAssertStmt(sizeof(futex->value) == sizeof(expected), "unexpected size");
+
+#if HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+ if (timeout)
+ {
+ uint64 timeout_ns = timeout->tv_sec * 1000000000 + timeout->tv_nsec;
+
+ if (os_sync_wait_on_address_with_timeout(&futex->value,
+ expected,
+ sizeof(futex->value),
+ OS_SYNC_WAIT_ON_ADDRESS_SHARED,
+ OS_CLOCK_MACH_ABSOLUTE_TIME,
+ timeout_ns) >= 0)
+ return 0;
+ }
+ else
+ {
+ if (os_sync_wait_on_address(&futex->value,
+ expected,
+ sizeof(futex->value),
+ OS_SYNC_WAIT_ON_ADDRESS_SHARED) >= 0)
+ return 0;
+ }
+ return -1;
+#else
+#error "futexes not implemented for this platform"
+#endif
+}
+
+/*
+ * Wake at least a given number of waiters.
+ *
+ * Note for future implementations: not all systems report how many were woken,
+ * so this function hides that by returning 0 for success.
+ */
+static inline int
+pg_futex_wake_u32(pg_futex_u32 *futex, int nwaiters)
+{
+#if HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+ int rc;
+
+ if (nwaiters == 1)
+ rc = os_sync_wake_by_address_any(&futex->value,
+ sizeof(futex->value),
+ OS_SYNC_WAKE_BY_ADDRESS_SHARED);
+ else
+ rc = os_sync_wake_by_address_all(&futex->value,
+ sizeof(futex->value),
+ OS_SYNC_WAKE_BY_ADDRESS_SHARED);
+ if (rc < 0 && errno == ENOENT)
+ rc = 0; /* no waiters */
+ return rc;
+#else
+#error "futexes not implemented for this platform"
+#endif
+}
+
+#endif /* HAVE_PG_FUTEX_U32 */
+
+#endif /* PG_FUTEX_H */
diff --git a/src/template/darwin b/src/template/darwin
index e8eb9390687..96aa464e424 100644
--- a/src/template/darwin
+++ b/src/template/darwin
@@ -14,17 +14,8 @@ fi
# Extra CFLAGS for code that will go into a shared library
CFLAGS_SL=""
-# Select appropriate semaphore support. Darwin 6.0 (macOS 10.2) and up
-# support System V semaphores; before that we have to use named POSIX
-# semaphores, which are less good for our purposes because they eat a
-# file descriptor per backend per max_connection slot.
-case $host_os in
- darwin[015].*)
- USE_NAMED_POSIX_SEMAPHORES=1
- ;;
- *)
- USE_SYSV_SEMAPHORES=1
- ;;
-esac
+# If targeting macOS 14.4 and above, use futexes to emulate unnamed semaphores,
+# and otherwise fall back to System V semaphores.
+PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
DLSUFFIX=".dylib"
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2d6..e8663470d51 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3871,6 +3871,7 @@ pg_enc2name
pg_encname
pg_fe_sasl_mech
pg_funcptr_t
+pg_futex_u32
pg_gssinfo
pg_hmac_ctx
pg_hmac_errno
@@ -3885,6 +3886,7 @@ pg_regex_t
pg_regmatch_t
pg_regoff_t
pg_saslprep_rc
+pg_sem_t
pg_sha1_ctx
pg_sha224_ctx
pg_sha256_ctx
--
2.39.5 (Apple Git-154)
On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a new attempt at that. It picks futexes automatically, unless
you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
sysv as before.
I compared my hand-rolled atomics logic against FreeBSD's user space
semaphores[1]https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c. sem_post() and sem_trywait() are about the same, and I
guess there just aren't many ways to write those. Minor differences:
1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.
2. In sem_trywait() they reload the value while our
pg_atomic_compare_exchange_u32() does that for you, and I also prefer
to avoid assignment expression syntax if I can.
On the other hand my v1 sem_wait() seemed a little janky and weird in
comparison to theirs, and I was a bit worried about bugs. I thought
about rewriting it to look more like that, but then I realised that I
could use sem_trywait() to implement sem_wait() for the same net
effect. It's much shorter and sweeter and more intuitively
understandable, I think?
I also didn't bother to supply a couple of functions that posix_sema.c
doesn't use.
There's no immediately obvious performance difference, but I wasn't
expecting there to be: it's used in the slow path of our LWLocks and
they already have similar user space atomics in front, so you don't
really avoid entering the kernel with this technique except perhaps
just occasionally if you wait just as someone posts. This is all just
about the convenience of bannishing System V cruft. It doesn't seem
to be any slower though, in simple testing, and CI might even be a bit
faster based on spot checks of a few runs... but proper statistical
tools might be needed to see if that is a real phenomenon.
Anyone got any better ideas for how to organise the build scripts,
control the feature, naming etc? I came up with is this, which
automatically falls back to sysv if it can't find what it needs
(respecting Apple's deployment target thing following what we did for
preadv):
meson.build: sema_type = "unnamed_posix+emulation"
configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
The pg_futex.h abstraction layer doesn't have any support for other
OSes yet because I tried to pare this patch down to the absolute
minimum to solve this problem, but I did try at least a little bit to
anticipate that (having removed it from earlier versions from a few
years ago) when designing the futex contracts. We could trivially add
at least OpenBSD support to use emulated semaphores there. I never
quite figured out whether the NetBSD futex API is really available
outside its Linux syscall emulator, but that might be doable too. And
futexes might of course turn out to have applications in other
synchronisation primitives.
Any thoughts on whether this is worth pursuing, and what kinds of
validation might be useful?
[1]: https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c
Attachments:
v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patchapplication/octet-stream; name=v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patchDownload
From bf2ceb73fe8a48ece981a82ef103284e5dfc0a54 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 24 Oct 2021 21:48:26 +1300
Subject: [PATCH v2] Use futex-based semaphore emulation on macOS.
Provide an implementation of the POSIX unnamed semaphore API when
requested at build time with the following options, enabled by default:
meson.build: sema_type = "unnamed_posix+emulation"
configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
With both build systems, it falls back to sysv if native futexes are not
detected. Only macOS 14.4's new os_sync_wait_on_address() is supported
initially, but this could probably be expanded to OpenBSD and NetBSD.
Futexes might have more general applications and on more platforms in
later work.
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKAPR%3DNXxMTCmEomCUpq%2BmrPLDVyfVa4bKZqfDjwyzD7Q%40mail.gmail.com
---
configure | 24 +++++++
configure.ac | 8 +++
meson.build | 14 ++++
src/backend/port/Makefile | 3 +-
src/backend/port/meson.build | 3 +
src/backend/port/posix_sema.c | 14 ++++
src/backend/port/sem_init.c | 92 ++++++++++++++++++++++++
src/include/pg_config.h.in | 7 ++
src/include/port/pg_futex.h | 120 +++++++++++++++++++++++++++++++
src/template/darwin | 15 +---
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 289 insertions(+), 13 deletions(-)
create mode 100644 src/backend/port/sem_init.c
create mode 100644 src/include/port/pg_futex.h
diff --git a/configure b/configure
index 507a2437c33..aba6f39eb38 100755
--- a/configure
+++ b/configure
@@ -16179,6 +16179,21 @@ _ACEOF
# We can't use AC_CHECK_FUNCS to detect these functions, because it
# won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "os_sync_wait_on_address" "ac_cv_have_decl_os_sync_wait_on_address" "#include <os/os_sync_wait_on_address.h>
+"
+if test "x$ac_cv_have_decl_os_sync_wait_on_address" = xyes; then :
+ ac_have_decl=1
+else
+ ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS $ac_have_decl
+_ACEOF
+if test $ac_have_decl = 1; then :
+ HAVE_NATIVE_FUTEX=1
+fi
+
ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
"
if test "x$ac_cv_have_decl_preadv" = xyes; then :
@@ -18419,6 +18434,15 @@ if test "$ac_res" != no; then :
fi
fi
+ if test x"$PREFERRED_SEMAPHORES" = x"UNNAMED_POSIX+EMULATION" ; then
+ # Need futex support for this
+ if test x"$HAVE_NATIVE_FUTEX" = x"1" ; then
+
+$as_echo "#define USE_SEMAPHORE_EMULATION 1" >>confdefs.h
+
+ USE_UNNAMED_POSIX_SEMAPHORES=1
+ fi
+ fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking which semaphore API to use" >&5
$as_echo_n "checking which semaphore API to use... " >&6; }
if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
diff --git a/configure.ac b/configure.ac
index 5f4548adc5c..71f7c77b381 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1841,6 +1841,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen, strsep, timingsafe_bcmp])
# We can't use AC_CHECK_FUNCS to detect these functions, because it
# won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([os_sync_wait_on_address], [HAVE_NATIVE_FUTEX=1], [], [#include <os/os_sync_wait_on_address.h>])
AC_CHECK_DECLS([preadv], [], [], [#include <sys/uio.h>])
AC_CHECK_DECLS([pwritev], [], [], [#include <sys/uio.h>])
AC_CHECK_DECLS([strchrnul], [], [], [#include <string.h>])
@@ -2272,6 +2273,13 @@ if test "$PORTNAME" != "win32"; then
# Need sem_init for this
AC_SEARCH_LIBS(sem_init, [rt pthread], [USE_UNNAMED_POSIX_SEMAPHORES=1])
fi
+ if test x"$PREFERRED_SEMAPHORES" = x"UNNAMED_POSIX+EMULATION" ; then
+ # Need futex support for this
+ if test x"$HAVE_NATIVE_FUTEX" = x"1" ; then
+ AC_DEFINE(USE_SEMAPHORE_EMULATION, 1, [Define to enable semaphore emulation with futexes.])
+ USE_UNNAMED_POSIX_SEMAPHORES=1
+ fi
+ fi
AC_MSG_CHECKING([which semaphore API to use])
if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
AC_DEFINE(USE_NAMED_POSIX_SEMAPHORES, 1, [Define to select named POSIX semaphores.])
diff --git a/meson.build b/meson.build
index 30e0edda3e7..3c360bb5d02 100644
--- a/meson.build
+++ b/meson.build
@@ -207,6 +207,7 @@ if host_system == 'cygwin'
mod_link_with_dir = 'libdir'
elif host_system == 'darwin'
+ sema_kind = 'unnamed_posix+emulation' # reverts to sysv if unsupported
dlsuffix = '.dylib'
library_path_var = 'DYLD_LIBRARY_PATH'
@@ -2660,6 +2661,7 @@ decl_checks = [
# checking for library symbols wouldn't handle deployment target
# restrictions on macOS
decl_checks += [
+ ['os_sync_wait_on_address', 'os/os_sync_wait_on_address.h'],
['preadv', 'sys/uio.h'],
['pwritev', 'sys/uio.h'],
['strchrnul', 'string.h'],
@@ -2968,6 +2970,18 @@ if sema_kind == 'unnamed_posix' and \
sema_kind = 'sysv'
endif
+# for unnamed_posix+emulation, use unnamed_posix and enable emulation, or fall
+# back to sysv if support isn't found
+if sema_kind == 'unnamed_posix+emulation'
+ if cdata.get('HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS') != 0
+ cdata.set('USE_SEMAPHORE_EMULATION', 1,
+ description: 'Define to 1 to enable emulation of unnamed POSIX semaphores')
+ sema_kind = 'unnamed_posix'
+ else
+ sema_kind = 'sysv'
+ endif
+endif
+
cdata.set('USE_@0@_SHARED_MEMORY'.format(shmem_kind.to_upper()), 1)
cdata.set('USE_@0@_SEMAPHORES'.format(sema_kind.to_upper()), 1)
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index 47338d99229..639049e6be7 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -25,7 +25,8 @@ OBJS = \
$(TAS) \
atomics.o \
pg_sema.o \
- pg_shmem.o
+ pg_shmem.o \
+ sem_init.o
ifeq ($(PORTNAME), win32)
SUBDIRS += win32
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 09d54e01d13..a130549ac9a 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -4,6 +4,9 @@ backend_sources += files(
'atomics.c',
)
+if cdata.has('USE_SEMAPHORE_EMULATION')
+ backend_sources += files('sem_init.c')
+endif
if cdata.has('USE_UNNAMED_POSIX_SEMAPHORES') or cdata.has('USE_NAMED_POSIX_SEMAPHORES')
backend_sources += files('posix_sema.c')
diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..707b0615559 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -36,6 +36,20 @@
#include "storage/pg_sema.h"
#include "storage/shmem.h"
+#ifdef USE_SEMAPHORE_EMULATION
+#include "port/pg_semaphore.h"
+
+/*
+ * Redirect to futex-based emulation of the POSIX unnamed API, for systems that
+ * can't do sem_init() with pshared=1.
+ */
+#define sem_t pg_sem_t
+#define sem_init pg_sem_init
+#define sem_destroy pg_sem_destroy
+#define sem_post pg_sem_post
+#define sem_wait pg_sem_wait
+#define sem_trywait pg_sem_trywait
+#endif
/* see file header comment */
#if defined(USE_NAMED_POSIX_SEMAPHORES) && defined(EXEC_BACKEND)
diff --git a/src/backend/port/sem_init.c b/src/backend/port/sem_init.c
new file mode 100644
index 00000000000..9d0f499240b
--- /dev/null
+++ b/src/backend/port/sem_init.c
@@ -0,0 +1,92 @@
+/*-------------------------------------------------------------------------
+ *
+ * sem_init.c
+ *
+ * Drop-in replacement for POSIX sem_init(), sem_post(), sem_wait() and
+ * sem_destroy(). These can be used on systems that don't provide unnamed
+ * semaphores with pshared=1, but do have shared memory futexes that we can use
+ * to build our own.
+ *
+ * pg_sem_t is a typedef for pg_futex_u32 in src/include/port/pg_semaphore.h.
+ * If we wanted to add a waiter count it could become a struct.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/port/sem_init.c
+ *
+ *-------------------------------------------------------------------------/
+ */
+
+#include "postgres.h"
+
+#ifdef USE_SEMAPHORE_EMULATION
+
+#include "port/atomics.h"
+#include "port/pg_futex.h"
+#include "port/pg_semaphore.h"
+
+/*
+ * Initialize a semaphore with a given value.
+ */
+int
+pg_sem_init(pg_sem_t *semaphore, int pshared, int value)
+{
+ pg_atomic_init_u32(&semaphore->value, value);
+ return 0;
+}
+
+/*
+ * Destroy a semaphore.
+ */
+int
+pg_sem_destroy(pg_sem_t *semaphore)
+{
+ return 0;
+}
+
+/*
+ * Increment the semaphore and wake any waiters.
+ *
+ * We use semaphores in contexts where there is always a waiter, so we don't
+ * bother with a waiter count that could suppress useless system calls here.
+ */
+int
+pg_sem_post(pg_sem_t *semaphore)
+{
+ pg_atomic_fetch_add_u32(&semaphore->value, 1);
+ return pg_futex_wake_u32(semaphore, INT_MAX);
+}
+
+/*
+ * Decrement the semaphore if it is above zero, or fail with EAGAIN if is
+ * already zero.
+ */
+int
+pg_sem_trywait(pg_sem_t *semaphore)
+{
+ uint32 value = pg_atomic_read_u32(&semaphore->value);
+
+ while (value > 0)
+ if (pg_atomic_compare_exchange_u32(&semaphore->value, &value, value - 1))
+ return 0;
+
+ errno = EAGAIN;
+ return -1;
+}
+
+/*
+ * Decrement the semaphore, waiting first for it to rise above zero if
+ * necessary.
+ */
+int
+pg_sem_wait(pg_sem_t *semaphore)
+{
+ while (pg_sem_trywait(semaphore) < 0)
+ if (pg_futex_wait_u32(semaphore, 0, NULL) < 0)
+ return -1;
+
+ return 0;
+}
+
+#endif /* HAVE_EMULATED_SEMAPHORES */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c4dc5d72bdb..905499a2933 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -95,6 +95,10 @@
don't. */
#undef HAVE_DECL_MEMSET_S
+/* Define to 1 if you have the declaration of `os_sync_wait_on_address', and
+ to 0 if you don't. */
+#undef HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+
/* Define to 1 if you have the declaration of `posix_fadvise', and to 0 if you
don't. */
#undef HAVE_DECL_POSIX_FADVISE
@@ -727,6 +731,9 @@
/* Define to 1 to build with PAM support. (--with-pam) */
#undef USE_PAM
+/* Define to enable semaphore emulation with futexes. */
+#undef USE_SEMAPHORE_EMULATION
+
/* Define to 1 to use software CRC-32C implementation (slicing-by-8). */
#undef USE_SLICING_BY_8_CRC32C
diff --git a/src/include/port/pg_futex.h b/src/include/port/pg_futex.h
new file mode 100644
index 00000000000..6fae7c7e810
--- /dev/null
+++ b/src/include/port/pg_futex.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_futex.h
+ *
+ * Futex abstraction. Initially supported only on macOS, but it could be
+ * extended to other systems. A wrapper struct and initialization function are
+ * used to reserve the option of providing a fallback implementation.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/pg_futex.h
+ *
+ *-------------------------------------------------------------------------/
+ */
+#ifndef PG_FUTEX_H
+#define PG_FUTEX_H
+
+#ifdef FRONTEND
+#error "not for use in frontend code"
+#endif
+
+#if HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+#include <os/os_sync_wait_on_address.h>
+#define HAVE_PG_FUTEX_U32
+#endif
+
+#ifdef HAVE_PG_FUTEX_U32
+
+#include "port/atomics.h"
+
+#include <time.h>
+
+typedef struct pg_futex_u32
+{
+ pg_atomic_uint32 value;
+} pg_futex_u32;
+
+static inline void
+pg_futex_init_u32(pg_futex_u32 *futex, uint32 value)
+{
+ pg_atomic_init_u32(&futex->value, value);
+}
+
+/*
+ * Return 0 immediately if futex->value is not equal to the expected value, and
+ * otherwise wait to be woken up explicitly by a thread that has changed the
+ * value, and then return 0. Return -1 and set errno on error. ETIMEDOUT
+ * indicates that the optional timeout has been reached.
+ *
+ * Note for future implementations: not all systems distinguish between value
+ * check failure and being woken up, so this function returns 0 in both cases.
+ */
+static inline int
+pg_futex_wait_u32(pg_futex_u32 *futex, uint32 expected, struct timespec *timeout)
+{
+ /*
+ * Our atomic 32 bit integers are just wrapped integers, so it is safe for
+ * the kernel to read them when it performs its value check.
+ */
+ StaticAssertStmt(sizeof(futex->value) == sizeof(expected), "unexpected size");
+
+#if HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+ if (timeout)
+ {
+ uint64 timeout_ns = timeout->tv_sec * 1000000000 + timeout->tv_nsec;
+
+ if (os_sync_wait_on_address_with_timeout(&futex->value,
+ expected,
+ sizeof(futex->value),
+ OS_SYNC_WAIT_ON_ADDRESS_SHARED,
+ OS_CLOCK_MACH_ABSOLUTE_TIME,
+ timeout_ns) >= 0)
+ return 0;
+ }
+ else
+ {
+ if (os_sync_wait_on_address(&futex->value,
+ expected,
+ sizeof(futex->value),
+ OS_SYNC_WAIT_ON_ADDRESS_SHARED) >= 0)
+ return 0;
+ }
+ return -1;
+#else
+#error "futexes not implemented for this platform"
+#endif
+}
+
+/*
+ * Wake at least a given number of waiters.
+ *
+ * Note for future implementations: not all systems report how many were woken,
+ * so this function hides that by returning 0 for success.
+ */
+static inline int
+pg_futex_wake_u32(pg_futex_u32 *futex, int nwaiters)
+{
+#if HAVE_DECL_OS_SYNC_WAIT_ON_ADDRESS
+ int rc;
+
+ if (nwaiters == 1)
+ rc = os_sync_wake_by_address_any(&futex->value,
+ sizeof(futex->value),
+ OS_SYNC_WAKE_BY_ADDRESS_SHARED);
+ else
+ rc = os_sync_wake_by_address_all(&futex->value,
+ sizeof(futex->value),
+ OS_SYNC_WAKE_BY_ADDRESS_SHARED);
+ if (rc < 0 && errno == ENOENT)
+ rc = 0; /* no waiters */
+ return rc;
+#else
+#error "futexes not implemented for this platform"
+#endif
+}
+
+#endif /* HAVE_PG_FUTEX_U32 */
+
+#endif /* PG_FUTEX_H */
diff --git a/src/template/darwin b/src/template/darwin
index e8eb9390687..96aa464e424 100644
--- a/src/template/darwin
+++ b/src/template/darwin
@@ -14,17 +14,8 @@ fi
# Extra CFLAGS for code that will go into a shared library
CFLAGS_SL=""
-# Select appropriate semaphore support. Darwin 6.0 (macOS 10.2) and up
-# support System V semaphores; before that we have to use named POSIX
-# semaphores, which are less good for our purposes because they eat a
-# file descriptor per backend per max_connection slot.
-case $host_os in
- darwin[015].*)
- USE_NAMED_POSIX_SEMAPHORES=1
- ;;
- *)
- USE_SYSV_SEMAPHORES=1
- ;;
-esac
+# If targeting macOS 14.4 and above, use futexes to emulate unnamed semaphores,
+# and otherwise fall back to System V semaphores.
+PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
DLSUFFIX=".dylib"
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2d6..e8663470d51 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3871,6 +3871,7 @@ pg_enc2name
pg_encname
pg_fe_sasl_mech
pg_funcptr_t
+pg_futex_u32
pg_gssinfo
pg_hmac_ctx
pg_hmac_errno
@@ -3885,6 +3886,7 @@ pg_regex_t
pg_regmatch_t
pg_regoff_t
pg_saslprep_rc
+pg_sem_t
pg_sha1_ctx
pg_sha224_ctx
pg_sha256_ctx
--
2.39.5 (Apple Git-154)
On Aug 12, 2025, at 4:07 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux. So that's pretty inconclusive about
what the aboriginal behavior was. I also found that OpenIndiana
behaves like Linux.Out of curiosity, FreeBSD changed that in 2016:
https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
I believe that this is the code in macOS's "XNU" [1]https://github.com/apple-oss-distributions/xnu/blob/e3723e1f17661b24996789d8afc084c0c3303b26/bsd/kern/sysv_sem.c#L860-L889[2]"XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and a C++ API for writing drivers called IOKit." which has similar
logic (unsurprisingly) to the FreeBSD code before the identified commit [3]https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e.
best.
-greg
[1]: https://github.com/apple-oss-distributions/xnu/blob/e3723e1f17661b24996789d8afc084c0c3303b26/bsd/kern/sysv_sem.c#L860-L889
[2]: "XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and a C++ API for writing drivers called IOKit."
[3]: https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
On Aug 13, 2025, at 8:01 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a new attempt at that. It picks futexes automatically, unless
you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
sysv as before.
Thomas, thanks for digging into this platform specific issue. I've read
your patch and done some research. Thoughts below...
I compared my hand-rolled atomics logic against FreeBSD's user space
semaphores[1]. sem_post() and sem_trywait() are about the same, and I
guess there just aren't many ways to write those. Minor differences:1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.
The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1]https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/sem_post.html says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
go after all?
2. In sem_trywait() they reload the value while our
pg_atomic_compare_exchange_u32() does that for you, and I also prefer
to avoid assignment expression syntax if I can.On the other hand my v1 sem_wait() seemed a little janky and weird in
comparison to theirs, and I was a bit worried about bugs. I thought
about rewriting it to look more like that, but then I realised that I
could use sem_trywait() to implement sem_wait() for the same net
effect. It's much shorter and sweeter and more intuitively
understandable, I think?
So if I understand it the way your sem_trywait() works is:
* Try non-blocking acquire with pg_sem_trywait()
* If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
* When woken up: Loop back to try again
This is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.
I also didn't bother to supply a couple of functions that posix_sema.c
doesn't use.
I think that's fine for now.
There's no immediately obvious performance difference, but I wasn't
expecting there to be: it's used in the slow path of our LWLocks and
they already have similar user space atomics in front, so you don't
really avoid entering the kernel with this technique except perhaps
just occasionally if you wait just as someone posts. This is all just
about the convenience of bannishing System V cruft. It doesn't seem
to be any slower though, in simple testing, and CI might even be a bit
faster based on spot checks of a few runs... but proper statistical
tools might be needed to see if that is a real phenomenon.
That lines up with my intuition after reading the patch. I don't know
that it is even worth diving too deep in performance testing this.
Anyone got any better ideas for how to organise the build scripts,
control the feature, naming etc? I came up with is this, which
automatically falls back to sysv if it can't find what it needs
(respecting Apple's deployment target thing following what we did for
preadv):meson.build: sema_type = "unnamed_posix+emulation"
configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
I'm not an expert on Meson, but I do have years of Autoconf work under
my belt. Offhand this seems reasonable, but I'm new to this project.
The pg_futex.h abstraction layer doesn't have any support for other
OSes yet because I tried to pare this patch down to the absolute
minimum to solve this problem, but I did try at least a little bit to
anticipate that (having removed it from earlier versions from a few
years ago) when designing the futex contracts. We could trivially add
at least OpenBSD support to use emulated semaphores there. I never
quite figured out whether the NetBSD futex API is really available
outside its Linux syscall emulator, but that might be doable too. And
futexes might of course turn out to have applications in other
synchronisation primitives.Any thoughts on whether this is worth pursuing, and what kinds of
validation might be useful?[1] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c
<v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patch>
Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.
best.
-greg
[1]: https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/sem_post.html
On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote:
1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
go after all?
Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was
looking at one of the earlier versions. But I see now that POSIX 2024
has just added it! So the question is: do we care?
Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess). I don't know
if systems with a low enough value exist in the wild, but if this
patch were to define its own PG_SEM_VALUE_MAX and then redefine
SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would
naturally use UINT32_MAX, and then that check would be tautological
given the types involved, not to mention our own lower limits. So
perhaps it's not worth more than a comment?
I've found only very brief discussions in the standards community so far:
https://austingroupbugs.net/view.php?id=315
https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow
So if I understand it the way your sem_trywait() works is:
* Try non-blocking acquire with pg_sem_trywait()
* If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
* When woken up: Loop back to try againThis is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.
Yeah. Linux gave us the name futex and seems to have established a
sort of de facto standard semantics for it, as seen here in macOS.
It's not the only way to skin that cat... IIUC there were various
older systems with the double-check to avoid the race against user
space, and "address as wait channel" wasn't new either, but I think at
least some of the other things like that might have required a
waitlist in user space; I'm not too sure about that history though, a
lot of it was probably closed source. It's not even too clear to me
how that waitlist tradeoff works out, but in any case, that is exactly
how a fallback implementation would probably work if we ever decided
we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
sizeof(uint32) and maybe even do away with the struct and just use
pg_atomic_uint32 directly then you could hash the address to select a
cv from a table of (say) 256 of them, with the annoying detail that
you'd probably need a separate wake/wait function pair for {DSM
segment handle, offset} for memory mapped at different addresses. But
we don't need to make decisions about any of that vapourware to solve
this immediate single-platform problem.
Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if
you tried it.
Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.
Thanks for looking! And especially for the sanity check on the
definitions of sem_wait()/sem_trywait().
On Sat, Aug 16, 2025 at 2:25 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess).
And just by the way, each backend has its own semaphore, so in actual
usage we're probably only talking about the "superfluous wakeups"
mentioned in lwlock.c, clog.c and procarray.c. I suppose it's not
expected to go very high at all? I was just trying to think about
where the bounds on that come from in theory, while working through
the case for ignoring EOVERFLOW...
Thomas Munro <thomas.munro@gmail.com> writes:
And just by the way, each backend has its own semaphore, so in actual
usage we're probably only talking about the "superfluous wakeups"
mentioned in lwlock.c, clog.c and procarray.c. I suppose it's not
expected to go very high at all?
I wouldn't expect so.
regards, tom lane
On Aug 15, 2025, at 10:25 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote:
1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add. Is that bad? I noticed their alternative
older version (remove "_new" from [1]) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
go after all?Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was
looking at one of the earlier versions. But I see now that POSIX 2024
has just added it! So the question is: do we care?
I’m on the fence about "caring", if your code fixes the issue we could
simply add a comment about eventually needing to support EOVERFLOW. I
don't think what you have in the patch is wrong and we're solving our
issue, not the worlds inconsistencies.
Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess). I don't know
if systems with a low enough value exist in the wild, but if this
patch were to define its own PG_SEM_VALUE_MAX and then redefine
SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would
naturally use UINT32_MAX, and then that check would be tautological
given the types involved, not to mention our own lower limits. So
perhaps it's not worth more than a comment?I've found only very brief discussions in the standards community so far:
https://austingroupbugs.net/view.php?id=315
https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow
I see what you mean, a comment that captures this decision and we're
done IMO.
So if I understand it the way your sem_trywait() works is:
* Try non-blocking acquire with pg_sem_trywait()
* If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
* When woken up: Loop back to try againThis is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.Yeah. Linux gave us the name futex and seems to have established a
sort of de facto standard semantics for it, as seen here in macOS.
Indeed, the "fast userspace mutex". *sigh* ;-P
I was calling this out because an implementation of sem_trywait() that
used sem_wait() that wasn't based on futex would be less correct AFAICT.
It's not the only way to skin that cat... IIUC there were various
older systems with the double-check to avoid the race against user
space, and "address as wait channel" wasn't new either, but I think at
least some of the other things like that might have required a
waitlist in user space; I'm not too sure about that history though, a
lot of it was probably closed source. It's not even too clear to me
how that waitlist tradeoff works out, but in any case, that is exactly
how a fallback implementation would probably work if we ever decided
we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
sizeof(uint32) and maybe even do away with the struct and just use
pg_atomic_uint32 directly then you could hash the address to select a
cv from a table of (say) 256 of them, with the annoying detail that
you'd probably need a separate wake/wait function pair for {DSM
segment handle, offset} for memory mapped at different addresses. But
we don't need to make decisions about any of that vapourware to solve
this immediate single-platform problem.
Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if
you tried it.
Oops, yep. You are right. :)
Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.Thanks for looking! And especially for the sanity check on the
definitions of sem_wait()/sem_trywait().
Sure thing, anytime. Not sure I helped much...
-greg