[PATCH] Make ENOSPC not fatal in semaphore creation
From: Mikhail <mp39590@gmail.com>
We might be in situation when we have "just enough" semaphores in the
system limit to start but previously crashed unexpectedly, in that case
we won't be able to start again - semget() will return ENOSPC, despite
the semaphores are ours, and we can recycle them, so check this
situation and try to remove the semaphore, if we are unable - give up
and abort.
---
src/backend/port/sysv_sema.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 21c883ba9a..a889591dba 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -88,10 +88,6 @@ static void ReleaseSemaphores(int status, Datum arg);
*
* Attempt to create a new semaphore set with the specified key.
* Will fail (return -1) if such a set already exists.
- *
- * If we fail with a failure code other than collision-with-existing-set,
- * print out an error and abort. Other types of errors suggest nonrecoverable
- * problems.
*/
static IpcSemaphoreId
InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
@@ -118,10 +114,33 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
return -1;
/*
- * Else complain and abort
+ * We might be in situation when we have "just enough" semaphores in the system
+ * limit to start but previously crashed unexpectedly, in that case we won't be
+ * able to start again - semget() will return ENOSPC, despite the semaphores
+ * are ours, and we can recycle them, so check this situation and try to remove
+ * the semaphore, if we are unable - give up and abort.
+ *
+ * We use same semkey for every start - it's gotten from inode number of the
+ * data folder. So on repeated starts we will use the same key.
*/
+ if (saved_errno == ENOSPC)
+ {
+ union semun semun;
+
+ semId = semget(semKey, 0, 0);
+
+ semun.val = 0; /* unused, but keep compiler quiet */
+ if (semctl(semId, 0, IPC_RMID, semun) == 0)
+ {
+ /* Recycled - get the same semaphore again */
+ semId = semget(semKey, numSems, IPC_CREAT | IPC_EXCL | IPCProtection);
+
+ return semId;
+ }
+ }
+
ereport(FATAL,
- (errmsg("could not create semaphores: %m"),
+ (errmsg("could not create semaphores: %s", strerror(saved_errno)),
errdetail("Failed system call was semget(%lu, %d, 0%o).",
(unsigned long) semKey, numSems,
IPC_CREAT | IPC_EXCL | IPCProtection),
--
2.33.0
mp39590@gmail.com writes:
We might be in situation when we have "just enough" semaphores in the
system limit to start but previously crashed unexpectedly, in that case
we won't be able to start again - semget() will return ENOSPC, despite
the semaphores are ours, and we can recycle them, so check this
situation and try to remove the semaphore, if we are unable - give up
and abort.
AFAICS, this patch could be disastrous. What if the semaphore in
question belongs to some other postmaster?
Also, you haven't explained why the existing (and much safer) recycling
logic in IpcSemaphoreCreate doesn't solve your problem.
regards, tom lane
On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote:
mp39590@gmail.com writes:
We might be in situation when we have "just enough" semaphores in the
system limit to start but previously crashed unexpectedly, in that case
we won't be able to start again - semget() will return ENOSPC, despite
the semaphores are ours, and we can recycle them, so check this
situation and try to remove the semaphore, if we are unable - give up
and abort.AFAICS, this patch could be disastrous. What if the semaphore in
question belongs to some other postmaster?
Does running more than one postmaster on the same PGDATA is supported at
all? Currently seed for the semaphore key is inode number of PGDATA.
Also, you haven't explained why the existing (and much safer) recycling
logic in IpcSemaphoreCreate doesn't solve your problem.
The logic of creating semas:
218 /* Loop till we find a free IPC key */
219 for (nextSemaKey++;; nextSemaKey++)
220 {
221 pid_t creatorPID;
222
223 /* Try to create new semaphore set */
224 semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1);
225 if (semId >= 0)
226 break; /* successful create */
InternalIpcSemaphoreCreate:
101 semId = semget(semKey, numSems, IPC_CREAT | IPC_EXCL | IPCProtection);
102
103 if (semId < 0)
104 {
105 int saved_errno = errno;
106
[...]
113 if (saved_errno == EEXIST || saved_errno == EACCES
114 #ifdef EIDRM
115 || saved_errno == EIDRM
116 #endif
117 )
118 return -1;
119
120 /*
121 * Else complain and abort
122 */
123 ereport(FATAL, [...]
semget() returns ENOSPC, so InternalIpcSemaphoreCreate doesn't return -1
so the whole logic of IpcSemaphoreCreate is not checked.
Mikhail <mp39590@gmail.com> writes:
On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote:
AFAICS, this patch could be disastrous. What if the semaphore in
question belongs to some other postmaster?
Does running more than one postmaster on the same PGDATA is supported at
all? Currently seed for the semaphore key is inode number of PGDATA.
That hardly guarantees no collisions. If it did, we'd never have bothered
with the PGSemaMagic business or the IpcSemaphoreGetLastPID check.
Also, you haven't explained why the existing (and much safer) recycling
logic in IpcSemaphoreCreate doesn't solve your problem.
semget() returns ENOSPC, so InternalIpcSemaphoreCreate doesn't return -1
so the whole logic of IpcSemaphoreCreate is not checked.
Hmm. Maybe you could improve this by removing the first
InternalIpcSemaphoreCreate call in IpcSemaphoreCreate, and
rearranging the logic so that the first step consists of seeing
whether a sema set is already there (and can safely be zapped),
and only then proceed with creation.
I am, however, concerned that this'll just trade off one hazard for
another. Instead of a risk of failing with ENOSPC (which the DBA
can fix), we'll have a risk of kneecapping some other process at
random (which the DBA can do nothing to prevent).
I'm also fairly unclear on when the logic you propose would trigger
at all. If the sema set is already there, I'd expect EEXIST or
equivalent, not ENOSPC.
regards, tom lane
On Sun, Oct 17, 2021 at 10:52:38AM -0400, Tom Lane wrote:
Mikhail <mp39590@gmail.com> writes:
On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote:
AFAICS, this patch could be disastrous. What if the semaphore in
question belongs to some other postmaster?Does running more than one postmaster on the same PGDATA is supported at
all? Currently seed for the semaphore key is inode number of PGDATA.That hardly guarantees no collisions. If it did, we'd never have bothered
with the PGSemaMagic business or the IpcSemaphoreGetLastPID check.
Got it, makes sense. Also, I was presented with examples that inode
number can be reused across mounting points for different clusters.
Also, you haven't explained why the existing (and much safer) recycling
logic in IpcSemaphoreCreate doesn't solve your problem.semget() returns ENOSPC, so InternalIpcSemaphoreCreate doesn't return -1
so the whole logic of IpcSemaphoreCreate is not checked.Hmm. Maybe you could improve this by removing the first
InternalIpcSemaphoreCreate call in IpcSemaphoreCreate, and
rearranging the logic so that the first step consists of seeing
whether a sema set is already there (and can safely be zapped),
and only then proceed with creation.
I think, I can look into this on the next weekend. On first glance the
solution works for me.
I am, however, concerned that this'll just trade off one hazard for
another. Instead of a risk of failing with ENOSPC (which the DBA
can fix), we'll have a risk of kneecapping some other process at
random (which the DBA can do nothing to prevent).
Good argument, but I'll try to make second version of the patch with the
proposed logic change to see what we will get. I think it's "right"
behavior to recycle our own used semaphores, so the whole approach is
correct.
I'm also fairly unclear on when the logic you propose would trigger
at all. If the sema set is already there, I'd expect EEXIST or
equivalent, not ENOSPC.
The logic works - the initial call to semget() in
InternalIpcSemaphoreCreate returns -1 and errno is set to ENOSPC - I
tested the patch on OpenBSD 7.0, it successfully recycles sem's after
previous "pkill -6 postgres". Verified it with 'ipcs -s'.
On Mon, Oct 18, 2021 at 4:49 AM Mikhail <mp39590@gmail.com> wrote:
The logic works - the initial call to semget() in
InternalIpcSemaphoreCreate returns -1 and errno is set to ENOSPC - I
tested the patch on OpenBSD 7.0, it successfully recycles sem's after
previous "pkill -6 postgres". Verified it with 'ipcs -s'.
Since you mentioned OpenBSD, what do you think of the idea of making
named POSIX semas the default on that platform? You can't run out of
those practically speaking, but then you get lots of little memory
mappings (from memory, at least it does close the fd for each one,
unlike some other OSes where we wouldn't want to use this technique).
Trivial patch:
/messages/by-id/CA+hUKGJVSjiDjbJpHwUrvA1TikFnJRfyJanrHofAWhnqcDJayQ@mail.gmail.com
No strong opinion on the tradeoffs here, as I'm not an OpenBSD user,
but it's something I think about whenever testing portability stuff
there and having to adjust the relevant sysctls.
Note: The best kind would be *unnamed* POSIX semas, where we get to
control their placement in existing memory; that's what we do on Linux
and FreeBSD. They weren't supported on OpenBSD last time we checked:
it rejects requests for shared ones. I wonder if someone could
implement them with just a few lines of user space code, using atomic
counters and futex() for waiting.
On Mon, Oct 18, 2021 at 10:07:40AM +1300, Thomas Munro wrote:
On Mon, Oct 18, 2021 at 4:49 AM Mikhail <mp39590@gmail.com> wrote:
The logic works - the initial call to semget() in
InternalIpcSemaphoreCreate returns -1 and errno is set to ENOSPC - I
tested the patch on OpenBSD 7.0, it successfully recycles sem's after
previous "pkill -6 postgres". Verified it with 'ipcs -s'.Since you mentioned OpenBSD, what do you think of the idea of making
named POSIX semas the default on that platform? You can't run out of
those practically speaking, but then you get lots of little memory
mappings (from memory, at least it does close the fd for each one,
unlike some other OSes where we wouldn't want to use this technique).
Trivial patch:/messages/by-id/CA+hUKGJVSjiDjbJpHwUrvA1TikFnJRfyJanrHofAWhnqcDJayQ@mail.gmail.com
No strong opinion on the tradeoffs here, as I'm not an OpenBSD user,
but it's something I think about whenever testing portability stuff
there and having to adjust the relevant sysctls.Note: The best kind would be *unnamed* POSIX semas, where we get to
control their placement in existing memory; that's what we do on Linux
and FreeBSD. They weren't supported on OpenBSD last time we checked:
it rejects requests for shared ones. I wonder if someone could
implement them with just a few lines of user space code, using atomic
counters and futex() for waiting.
Hello, sorry for not replying earlier - I was able to think about and
test the patch only on the weekend.
I totally agree with your approach, in conversation with one of the
OpenBSD developers he supported using of sem_open(), because most ports
use it and consistency is desirable across our ports tree. It looks like
PostgreSQL was the only port to use semget().
Switching to sem_open() also looks much safer than patching sysv_sema.c
for corner ENOSPC case as Tom already mentioned.
In your patch I've removed testing for 5.x versions, because official
releases are supported only for one year, no need to worry about them.
The patch is tested with 'make installcheck', also I can confirm that
'ipcs' shows that no semaphores are used, and server starts normally
after 'pkill -6 postgres' with the default semmns sysctl, what was the
original motivation for the work.
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index d74d1ed7af..2dfea0662b 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -998,21 +998,7 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such
<para>
The default shared memory settings are usually good enough, unless
you have set <literal>shared_memory_type</literal> to <literal>sysv</literal>.
- You will usually want to
- increase <literal>kern.seminfo.semmni</literal>
- and <literal>kern.seminfo.semmns</literal>,
- as <systemitem class="osname">OpenBSD</systemitem>'s default settings
- for these are uncomfortably small.
- </para>
-
- <para>
- IPC parameters can be adjusted using <command>sysctl</command>,
- for example:
-<screen>
-<prompt>#</prompt> <userinput>sysctl kern.seminfo.semmni=100</userinput>
-</screen>
- To make these settings persist over reboots, modify
- <filename>/etc/sysctl.conf</filename>.
+ System V semaphores are not used on this platform.
</para>
</listitem>
diff --git a/src/template/openbsd b/src/template/openbsd
index 365268c489..41221af382 100644
--- a/src/template/openbsd
+++ b/src/template/openbsd
@@ -2,3 +2,7 @@
# Extra CFLAGS for code that will go into a shared library
CFLAGS_SL="-fPIC -DPIC"
+
+# OpenBSD 5.5 (2014) gained named POSIX semaphores. They work out of the box
+# without changing any sysctl settings, unlike System V semaphores.
+USE_NAMED_POSIX_SEMAPHORES=1
Mikhail <mp39590@gmail.com> writes:
In your patch I've removed testing for 5.x versions, because official
releases are supported only for one year, no need to worry about them.
Official support or no, we have OpenBSD 5.9 in our buildfarm, so
ignoring the case isn't going to fly.
regards, tom lane
On Sat, Oct 23, 2021 at 8:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mikhail <mp39590@gmail.com> writes:
In your patch I've removed testing for 5.x versions, because official
releases are supported only for one year, no need to worry about them.Official support or no, we have OpenBSD 5.9 in our buildfarm, so
ignoring the case isn't going to fly.
It was a test for < 5.5, so that aspect's OK.
On Fri, Oct 22, 2021 at 03:43:00PM -0400, Tom Lane wrote:
Mikhail <mp39590@gmail.com> writes:
In your patch I've removed testing for 5.x versions, because official
releases are supported only for one year, no need to worry about them.Official support or no, we have OpenBSD 5.9 in our buildfarm, so
ignoring the case isn't going to fly.
5.9 has support for unnamed POSIX semas. Do you think new machine with
OpenBSD <5.5 (when unnamed POSIX semas were introduced) can appear in
buildfarm or be used by real customer?
I have no objections on testing "openbsd5.[01234]" and using SysV semas
there and can redo and test the patch, but isn't it over caution?
Mikhail <mp39590@gmail.com> writes:
On Fri, Oct 22, 2021 at 03:43:00PM -0400, Tom Lane wrote:
Official support or no, we have OpenBSD 5.9 in our buildfarm, so
ignoring the case isn't going to fly.
5.9 has support for unnamed POSIX semas. Do you think new machine with
OpenBSD <5.5 (when unnamed POSIX semas were introduced) can appear in
buildfarm or be used by real customer?
Nah, I misunderstood you to say that 5.9 would also be affected.
regards, tom lane
Mikhail <mp39590@gmail.com> writes:
+# OpenBSD 5.5 (2014) gained named POSIX semaphores. They work out of the box +# without changing any sysctl settings, unlike System V semaphores. +USE_NAMED_POSIX_SEMAPHORES=1
I tried this on an OpenBSD 6.0 image I had handy. The good news is
that it works, and I can successfully start the postmaster with a lot
of semaphores (I tried with max_connections=10000) without any special
system configuration. The bad news is it's *slow*. It takes the
postmaster over a minute to start up at 10000 max_connections, and
also about 15 seconds to shut down. The regression tests also appear
noticeably slower, even at the default max_connections=100. I'm
afraid that those "lots of tiny mappings" that Thomas noted have
a nasty impact on our process launch times, since the kernel
presumably has to do work to clone them into the child process.
Now this lashup that I'm testing on is by no means well suited for
performance tests, so maybe my numbers are bogus. Also, maybe it's
better in more recent OpenBSD releases. But I think we need to take a
harder look at performance before we decide that it's okay to change
the default semaphore type for this platform.
regards, tom lane
On Fri, Oct 22, 2021 at 09:00:31PM -0400, Tom Lane wrote:
I tried this on an OpenBSD 6.0 image I had handy. The good news is
that it works, and I can successfully start the postmaster with a lot
of semaphores (I tried with max_connections=10000) without any special
system configuration. The bad news is it's *slow*. It takes the
postmaster over a minute to start up at 10000 max_connections, and
also about 15 seconds to shut down. The regression tests also appear
noticeably slower, even at the default max_connections=100. I'm
afraid that those "lots of tiny mappings" that Thomas noted have
a nasty impact on our process launch times, since the kernel
presumably has to do work to clone them into the child process.Now this lashup that I'm testing on is by no means well suited for
performance tests, so maybe my numbers are bogus. Also, maybe it's
better in more recent OpenBSD releases. But I think we need to take a
harder look at performance before we decide that it's okay to change
the default semaphore type for this platform.
I got following results for "time make installcheck" on a laptop with
OpenBSD 7.0 (amd64):
POSIX (max_connections=100) (default): 1m32.39s real 0m03.82s user 0m05.75s system
POSIX (max_connections=10000): 2m13.11s real 0m03.56s user 0m07.06s system
SysV (max_connections=100) (default): 1m24.39s real 0m03.30s user 0m04.94s system
SysV (max_connections=10000): failed to start
after sysctl tunning:
SysV (max_connections=10000): 1m47.51s real 0m03.78s user 0m05.61s system
I can confirm that start and stop of the server was slower in POSIX
case, but not terribly different (seconds, not a minute, as in your
case).
As the OpenBSD developers said - those who use OpenBSD are never after a
good performance, the system has a lot of bottlenecks except IPCs.
I see following reasons to switch from SysV to POSIX:
- consistency in the ports tree, all major ports use POSIX, it means
better testing of the API
- as already pointed out - OpenBSD isn't about performance, and the
results for default max_connections are pretty close
- crash recovery with the OS defaults is automatic and don't require DBA
intervention and knowledge of ipcs and ipcrm
- higher density is available without system tuning
The disadvantage is in a worse performance for extreme cases, but I'm
not sure OpenBSD is used for them in production.
On Sun, Oct 17, 2021 at 10:52:38AM -0400, Tom Lane wrote:
I am, however, concerned that this'll just trade off one hazard for
another. Instead of a risk of failing with ENOSPC (which the DBA
can fix), we'll have a risk of kneecapping some other process at
random (which the DBA can do nothing to prevent).
I tend to agree, and along with semas patch would like to suggest error
message improvement, it would have saved me about half a day of digging.
Tested on OpenBSD 7.0.
I'm not a native speaker though, so grammar need to be checked.
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 21c883ba9a..b84f70b5e2 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -133,7 +133,10 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
"respective kernel parameter. Alternatively, reduce PostgreSQL's "
"consumption of semaphores by reducing its max_connections parameter.\n"
"The PostgreSQL documentation contains more information about "
- "configuring your system for PostgreSQL.") : 0));
+ "configuring your system for PostgreSQL.\n"
+ "If server has crashed previously there may be resources left "
+ "after it - take a look at ipcs(1) and ipcrm(1) man pages to see "
+ "how to remove them.") : 0));
}
return semId;
On Mon, Oct 18, 2021 at 10:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Note: The best kind would be *unnamed* POSIX semas, where we get to
control their placement in existing memory; that's what we do on Linux
and FreeBSD. They weren't supported on OpenBSD last time we checked:
it rejects requests for shared ones. I wonder if someone could
implement them with just a few lines of user space code, using atomic
counters and futex() for waiting.
I meant that it'd be cool if OpenBSD implemented shared memory unnamed
semas that way (as other OSes do), but just for fun I tried
implementing that in PostgreSQL. I already had a patch to provide a
wrapper API for futexes on a bunch of OSes including OpenBSD (because
I've been looking into ways to rewrite lwlock.c to use futexes
directly and skip all the per-backend semaphore stuff). That made it
easy to write a quick-and-dirty clone of sem_{init,wait,post}() using
atomics and futexes.
Sadly, although the attached proof-of-concept patch allows a
PREFERRED_SEMAPHORES=FUTEX build to pass tests on macOS (which also
lacks native unnamed semas), FreeBSD and Linux (which don't need this
but are interesting to test), and it also works on OpenBSD with
shared_memory_type=sysv, it doesn't work on OpenBSD with
shared_memory_type=mmap (the default). I suspect OpenBSD's futex(2)
has a bug: inherited anonymous shared mmap memory seems to confuse it
so that wakeups are lost. Arrrgh!
Attachments:
0001-A-basic-API-for-futexes.patchtext/x-patch; charset=US-ASCII; name=0001-A-basic-API-for-futexes.patchDownload
From 38bb61aeeefc24cdc772474b0fb4f1802d3b7577 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 24 Oct 2021 21:48:26 +1300
Subject: [PATCH 1/2] A basic API for futexes.
A thin wrapper for basic 32 bit futex wait and wake. Currently, it maps
to native support on Linux, FreeBSD, OpenBSD and macOS, with detection
via configure. More operating systems and more operations are possible.
---
configure | 4 +-
configure.ac | 4 ++
src/include/pg_config.h.in | 12 ++++
src/include/port/pg_futex.h | 139 ++++++++++++++++++++++++++++++++++++
4 files changed, 157 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_futex.h
diff --git a/configure b/configure
index 4ffefe4655..23f1cbe9d0 100755
--- a/configure
+++ b/configure
@@ -13381,7 +13381,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
fi
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h linux/futex.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/futex.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/umtx.h sys/un.h termios.h ucred.h wctype.h
do :
as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
@@ -15492,7 +15492,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
+for ac_func in __ulock_wait backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 44ee3ebe2f..542b46437e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1363,10 +1363,12 @@ AC_CHECK_HEADERS(m4_normalize([
getopt.h
ifaddrs.h
langinfo.h
+ linux/futex.h
mbarrier.h
poll.h
sys/epoll.h
sys/event.h
+ sys/futex.h
sys/ipc.h
sys/prctl.h
sys/procctl.h
@@ -1378,6 +1380,7 @@ AC_CHECK_HEADERS(m4_normalize([
sys/sockio.h
sys/tas.h
sys/uio.h
+ sys/umtx.h
sys/un.h
termios.h
ucred.h
@@ -1698,6 +1701,7 @@ LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
AC_CHECK_FUNCS(m4_normalize([
+ __ulock_wait
backtrace_symbols
clock_gettime
copyfile
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 15ffdd895a..6bd2f7b5d8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -367,6 +367,9 @@
/* Define to 1 if you have the `link' function. */
#undef HAVE_LINK
+/* Define to 1 if you have the <linux/futex.h> header file. */
+#undef HAVE_LINUX_FUTEX_H
+
/* Define to 1 if the system has the type `locale_t'. */
#undef HAVE_LOCALE_T
@@ -623,6 +626,9 @@
/* Define to 1 if you have the <sys/event.h> header file. */
#undef HAVE_SYS_EVENT_H
+/* Define to 1 if you have the <sys/futex.h> header file. */
+#undef HAVE_SYS_FUTEX_H
+
/* Define to 1 if you have the <sys/ipc.h> header file. */
#undef HAVE_SYS_IPC_H
@@ -665,6 +671,9 @@
/* Define to 1 if you have the <sys/uio.h> header file. */
#undef HAVE_SYS_UIO_H
+/* Define to 1 if you have the <sys/umtx.h> header file. */
+#undef HAVE_SYS_UMTX_H
+
/* Define to 1 if you have the <sys/un.h> header file. */
#undef HAVE_SYS_UN_H
@@ -779,6 +788,9 @@
/* Define to 1 if you have the `__strtoull' function. */
#undef HAVE___STRTOULL
+/* Define to 1 if you have the `__ulock_wait' function. */
+#undef HAVE___ULOCK_WAIT
+
/* Define to the appropriate printf length modifier for 64-bit ints. */
#undef INT64_MODIFIER
diff --git a/src/include/port/pg_futex.h b/src/include/port/pg_futex.h
new file mode 100644
index 0000000000..bf254a18b0
--- /dev/null
+++ b/src/include/port/pg_futex.h
@@ -0,0 +1,139 @@
+/*
+ * Minimal wrapper over futex APIs.
+ */
+
+#ifndef PG_FUTEX_H
+#define PG_FUTEX_H
+
+#if defined(HAVE_LINUX_FUTEX_H)
+
+/* https://man7.org/linux/man-pages/man2/futex.2.html */
+
+#include <linux/futex.h>
+#include <sys/syscall.h>
+
+#elif defined(HAVE_SYS_FUTEX_H)
+
+/* https://man.openbsd.org/futex, since OpenBSD 6.2. */
+
+#include <sys/time.h>
+#include <sys/futex.h>
+
+#elif defined(HAVE_SYS_UMTX_H)
+
+/* https://www.freebsd.org/cgi/man.cgi?query=_umtx_op */
+
+#include <sys/types.h>
+#include <sys/umtx.h>
+
+#elif defined(HAVE___ULOCK_WAIT)
+
+/*
+ * This interface is undocumented, but provided by libSystem.dylib since
+ * xnu-3789.1.32 (macOS 10.12, 2016) and is used by eg libc++.
+ *
+ * https://github.com/apple/darwin-xnu/blob/main/bsd/kern/sys_ulock.c
+ * https://github.com/apple/darwin-xnu/blob/main/bsd/sys/ulock.h
+ */
+
+#include <stdint.h>
+
+#define UL_COMPARE_AND_WAIT_SHARED 3
+#define ULF_WAKE_ALL 0x00000100
+extern int __ulock_wait(uint32_t operation,
+ void *addr,
+ uint64_t value,
+ uint32_t timeout);
+extern int __ulock_wake(uint32_t operation,
+ void *addr,
+ uint64_t wake_value);
+
+#endif
+
+/*
+ * Wait for someone to call pg_futex_wake() for the same address, with an
+ * initial check that the value pointed to by 'fut' matches 'value' and an
+ * optional timeout. Returns 0 when woken, and otherwise -1, with errno set to
+ * EAGAIN if the initial value check fails, and otherwise errors including
+ * EINTR, ETIMEDOUT and EFAULT.
+ */
+static int
+pg_futex_wait_u32(volatile void *fut,
+ uint32 value,
+ struct timespec *timeout)
+{
+#if defined(HAVE_LINUX_FUTEX_H)
+ if (syscall(SYS_futex, fut, FUTEX_WAIT, value, timeout, 0, 0) == 0)
+ return 0;
+#elif defined(HAVE_SYS_FUTEX_H)
+ if ((errno = futex(fut, FUTEX_WAIT, (int) value, timeout, NULL)) == 0)
+ return 0;
+ if (errno == ECANCELED)
+ errno = EINTR;
+#elif defined(HAVE_SYS_UMTX_H)
+ if (_umtx_op(fut, UMTX_OP_WAIT_UINT, value, 0, timeout) == 0)
+ return 0;
+#elif defined (HAVE___ULOCK_WAIT)
+ if (__ulock_wait(UL_COMPARE_AND_WAIT_SHARED,
+ (void *) fut,
+ value,
+ timeout ? timeout->tv_sec * 1000000 + timeout->tv_nsec / 1000 : 0) >= 0)
+ return 0;
+#else
+ /*
+ * If we wanted to simulate futexes on systems that don't have them, here
+ * we could add a link from our PGPROC struct to a shared memory hash
+ * table using "fut" (ie address) as the key, then compare *fut == value.
+ * If false, remove link and fail with EAGAIN. If so, sleep on proc latch.
+ * The main complication is that it wouldn't work for DSM segments; for
+ * those, we could have variants that take a dsm_segment and pointer and
+ * convert that to segment key + offset.
+ */
+ errno = ENOSYS;
+#endif
+
+ Assert(errno != 0);
+
+ return -1;
+}
+
+/*
+ * Wake up to nwaiters waiters that currently wait on the same address as
+ * 'fut'. Returns 0 on success, and -1 on failure, with errno set. Though
+ * some of these interfaces can tell us how many were woken, they can't all do
+ * that, so we'll hide that information.
+ */
+static int
+pg_futex_wake(volatile void *fut, int nwaiters)
+{
+#if defined(HAVE_LINUX_FUTEX_H)
+ if (syscall(SYS_futex, fut, FUTEX_WAKE, nwaiters, NULL, 0, 0) >= 0)
+ return 0;
+#elif defined(HAVE_SYS_FUTEX_H)
+ if (futex(fut, FUTEX_WAKE, nwaiters, NULL, NULL) >= 0)
+ return 0;
+#elif defined(HAVE_SYS_UMTX_H)
+ if (_umtx_op(fut, UMTX_OP_WAKE, nwaiters, 0, 0) == 0)
+ return 0;
+#elif defined (HAVE___ULOCK_WAIT)
+ if (__ulock_wake(UL_COMPARE_AND_WAIT_SHARED | (nwaiters > 1 ? ULF_WAKE_ALL : 0),
+ (void *) fut,
+ 0) >= 0)
+ return 0;
+ if (errno == ENOENT)
+ return 0;
+#else
+ /*
+ * If we wanted to simulate futexes on systems that don't have them, here
+ * we could look up "fut" in a shmem hash table set and set latches for
+ * any matches.
+ */
+ errno = ENOSYS;
+#endif
+
+ Assert(errno != 0);
+
+ return -1;
+}
+
+#endif
--
2.30.2
0002-Add-futex-based-semaphore-replacement.patchtext/x-patch; charset=US-ASCII; name=0002-Add-futex-based-semaphore-replacement.patchDownload
From bfe33b34dd6368c5c874d2777c045899272e7ff8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 24 Oct 2021 21:48:26 +1300
Subject: [PATCH 2/2] Add futex-based semaphore replacement.
Provide a drop-in replacement for POSIX unnamed semaphores using
futexes. Select with PREFERRED_SEMAPHORES=FUTEX. Perhaps useful on
OSes that lack shmem unnamed semaphores but have a futex facility.
XXX POC code only!
---
configure | 16 ++++-
configure.ac | 16 ++++-
src/backend/port/posix_sema.c | 106 +++++++++++++++++++++++++++++++++-
src/include/pg_config.h.in | 3 +
4 files changed, 134 insertions(+), 7 deletions(-)
diff --git a/configure b/configure
index 23f1cbe9d0..4e57c6436b 100755
--- a/configure
+++ b/configure
@@ -18486,6 +18486,10 @@ if test "$ac_res" != no; then :
fi
fi
+ if test x"$PREFERRED_SEMAPHORES" = x"FUTEX" ; then
+ # Need futex implementation for this
+ USE_FUTEX_SEMAPHORES=1
+ fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking which semaphore API to use" >&5
$as_echo_n "checking which semaphore API to use... " >&6; }
if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
@@ -18502,11 +18506,19 @@ $as_echo "#define USE_UNNAMED_POSIX_SEMAPHORES 1" >>confdefs.h
SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
sematype="unnamed POSIX"
else
+ if test x"$USE_FUTEX_SEMAPHORES" = x"1" ; then
+
+$as_echo "#define USE_FUTEX_SEMAPHORES 1" >>confdefs.h
+
+ SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
+ sematype="futex"
+ else
$as_echo "#define USE_SYSV_SEMAPHORES 1" >>confdefs.h
- SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
- sematype="System V"
+ SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
+ sematype="System V"
+ fi
fi
fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $sematype" >&5
diff --git a/configure.ac b/configure.ac
index 542b46437e..149ee401de 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2217,6 +2217,10 @@ if test "$PORTNAME" != "win32"; then
# Need sem_init for this
AC_SEARCH_LIBS(sem_init, [rt pthread], [USE_UNNAMED_POSIX_SEMAPHORES=1])
fi
+ if test x"$PREFERRED_SEMAPHORES" = x"FUTEX" ; then
+ # Need futex implementation for this
+ USE_FUTEX_SEMAPHORES=1
+ fi
AC_MSG_CHECKING([which semaphore API to use])
if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
AC_DEFINE(USE_NAMED_POSIX_SEMAPHORES, 1, [Define to select named POSIX semaphores.])
@@ -2228,9 +2232,15 @@ if test "$PORTNAME" != "win32"; then
SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
sematype="unnamed POSIX"
else
- AC_DEFINE(USE_SYSV_SEMAPHORES, 1, [Define to select SysV-style semaphores.])
- SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
- sematype="System V"
+ if test x"$USE_FUTEX_SEMAPHORES" = x"1" ; then
+ AC_DEFINE(USE_FUTEX_SEMAPHORES, 1, [Define to select futex semaphores.])
+ SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
+ sematype="futex"
+ else
+ AC_DEFINE(USE_SYSV_SEMAPHORES, 1, [Define to select SysV-style semaphores.])
+ SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
+ sematype="System V"
+ fi
fi
fi
AC_MSG_RESULT([$sematype])
diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 114da3b30c..b0c65d8716 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -36,6 +36,10 @@
#include "storage/pg_sema.h"
#include "storage/shmem.h"
+#if defined(USE_FUTEX_SEMAPHORES)
+#include "port/atomics.h"
+#include "port/pg_futex.h"
+#endif
/* see file header comment */
#if defined(USE_NAMED_POSIX_SEMAPHORES) && defined(EXEC_BACKEND)
@@ -45,6 +49,9 @@
typedef union SemTPadded
{
sem_t pgsem;
+#if defined(USE_FUTEX_SEMAPHORES)
+ pg_atomic_uint32 futexsem;
+#endif
char pad[PG_CACHE_LINE_SIZE];
} SemTPadded;
@@ -70,6 +77,78 @@ static int nextSemKey; /* next name to try */
static void ReleaseSemaphores(int status, Datum arg);
+#ifdef USE_FUTEX_SEMAPHORES
+
+/*
+ * An implementation of POSIX unnamed semaphores in shared memory, for OSes
+ * that lack them but have futexes.
+ */
+
+static void
+pg_futex_sem_init(pg_atomic_uint32 *fut, uint32 value)
+{
+ pg_atomic_init_u32(fut, value);
+}
+
+static int
+pg_futex_sem_wait(pg_atomic_uint32 *fut)
+{
+ /*
+ * This will only work if our atomic types are wrappers around raw
+ * integers, so that the kernel can test the value at that address.
+ */
+ StaticAssertStmt(sizeof(*fut) == sizeof(uint32),
+ "cannot use emulated atomics with futexes");
+
+ for (;;)
+ {
+ uint32 old_value = pg_atomic_read_u32(fut);
+ if (old_value == 0)
+ {
+ /* Wait for someone else to move it above 0. */
+ if (pg_futex_wait_u32(fut, 0, NULL) < 0)
+ {
+ if (errno != EAGAIN)
+ return -1;
+ /* The value changed under our feet. Try again. */
+ }
+ }
+ else
+ {
+ /* Try to decrement it. */
+ if (pg_atomic_compare_exchange_u32(fut, &old_value, old_value - 1))
+ return 0; /* success */
+ }
+ }
+
+ pg_unreachable();
+}
+
+static int
+pg_futex_sem_post(pg_atomic_uint32 *fut)
+{
+ for (;;)
+ {
+ uint32 old_value = pg_atomic_read_u32(fut);
+
+ if (pg_atomic_compare_exchange_u32(fut, &old_value, old_value + 1))
+ {
+ /*
+ * XXX TODO: encode nwaiters into value, so we can suppress useless
+ * wake calls.
+ */
+ if (pg_futex_wake(fut, INT_MAX) < 0)
+ {
+ /* Undo value change? */
+ return -1;
+ }
+ break;
+ }
+ }
+ return 0;
+}
+
+#endif
#ifdef USE_NAMED_POSIX_SEMAPHORES
@@ -124,7 +203,7 @@ PosixSemaphoreCreate(void)
return mySem;
}
-#else /* !USE_NAMED_POSIX_SEMAPHORES */
+#elif defined(USE_UNNAMED_POSIX_SEMAPHORES)
/*
* PosixSemaphoreCreate
@@ -139,6 +218,7 @@ PosixSemaphoreCreate(sem_t *sem)
}
#endif /* USE_NAMED_POSIX_SEMAPHORES */
+#ifndef USE_FUTEX_SEMAPHORES
/*
* PosixSemaphoreKill - removes a semaphore
@@ -156,6 +236,7 @@ PosixSemaphoreKill(sem_t *sem)
elog(LOG, "sem_destroy failed: %m");
#endif
}
+#endif
/*
@@ -239,18 +320,22 @@ PGReserveSemaphores(int maxSemas)
static void
ReleaseSemaphores(int status, Datum arg)
{
+#ifdef USE_NAMED_POSIX_SEMAPHORES
int i;
-#ifdef USE_NAMED_POSIX_SEMAPHORES
for (i = 0; i < numSems; i++)
PosixSemaphoreKill(mySemPointers[i]);
free(mySemPointers);
#endif
#ifdef USE_UNNAMED_POSIX_SEMAPHORES
+ int i;
+
for (i = 0; i < numSems; i++)
PosixSemaphoreKill(PG_SEM_REF(sharedSemas + i));
#endif
+
+ /* Futex-based semaphores have no kernel resource to clean up. */
}
/*
@@ -262,7 +347,9 @@ PGSemaphore
PGSemaphoreCreate(void)
{
PGSemaphore sema;
+#ifndef USE_FUTEX_SEMAPHORES
sem_t *newsem;
+#endif
/* Can't do this in a backend, because static state is postmaster's */
Assert(!IsUnderPostmaster);
@@ -275,6 +362,9 @@ PGSemaphoreCreate(void)
/* Remember new sema for ReleaseSemaphores */
mySemPointers[numSems] = newsem;
sema = (PGSemaphore) newsem;
+#elif defined(USE_FUTEX_SEMAPHORES)
+ sema = &sharedSemas[numSems];
+ pg_futex_sem_init(&sema->sem_padded.futexsem, 1);
#else
sema = &sharedSemas[numSems];
newsem = PG_SEM_REF(sema);
@@ -294,6 +384,9 @@ PGSemaphoreCreate(void)
void
PGSemaphoreReset(PGSemaphore sema)
{
+#ifdef USE_FUTEX_SEMAPHORES
+ pg_atomic_write_u32(&sema->sem_padded.futexsem, 0);
+#else
/*
* There's no direct API for this in POSIX, so we have to ratchet the
* semaphore down to 0 with repeated trywait's.
@@ -309,6 +402,7 @@ PGSemaphoreReset(PGSemaphore sema)
elog(FATAL, "sem_trywait failed: %m");
}
}
+#endif
}
/*
@@ -324,7 +418,11 @@ PGSemaphoreLock(PGSemaphore sema)
/* See notes in sysv_sema.c's implementation of PGSemaphoreLock. */
do
{
+#if defined(USE_FUTEX_SEMAPHORES)
+ errStatus = pg_futex_sem_wait(&sema->sem_padded.futexsem);
+#else
errStatus = sem_wait(PG_SEM_REF(sema));
+#endif
} while (errStatus < 0 && errno == EINTR);
if (errStatus < 0)
@@ -349,7 +447,11 @@ PGSemaphoreUnlock(PGSemaphore sema)
*/
do
{
+#if defined(USE_FUTEX_SEMAPHORES)
+ errStatus = pg_futex_sem_post(&sema->sem_padded.futexsem);
+#else
errStatus = sem_post(PG_SEM_REF(sema));
+#endif
} while (errStatus < 0 && errno == EINTR);
if (errStatus < 0)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6bd2f7b5d8..b365d6ced3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -916,6 +916,9 @@
/* Define to 1 to build with BSD Authentication support. (--with-bsd-auth) */
#undef USE_BSD_AUTH
+/* Define to select futex semaphores. */
+#undef USE_FUTEX_SEMAPHORES
+
/* Define to build with ICU support. (--with-icu) */
#undef USE_ICU
--
2.30.2
On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote:
Also, you haven't explained why the existing (and much safer) recycling
logic in IpcSemaphoreCreate doesn't solve your problem.
I think I'll drop the diffs, you're right that current proven logic need
not to be changed for such rare corner case, which DBA can fix.
I've added references to ipcs(1) and ipcrm(1) in OpenBSD's semget(2) man
page, so newcomer won't need to spend hours digging in sysv semas
management, if one would encounter the same situation as I did.
Thanks for reviews.
On Sun, Oct 24, 2021 at 10:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Sadly, although the attached proof-of-concept patch allows a
PREFERRED_SEMAPHORES=FUTEX build to pass tests on macOS (which also
lacks native unnamed semas), FreeBSD and Linux (which don't need this
but are interesting to test), and it also works on OpenBSD with
shared_memory_type=sysv, it doesn't work on OpenBSD with
shared_memory_type=mmap (the default). I suspect OpenBSD's futex(2)
has a bug: inherited anonymous shared mmap memory seems to confuse it
so that wakeups are lost. Arrrgh!
FWIW I'm trying to follow up with the OpenBSD list over here, because
it'd be nice to get that working:
https://marc.info/?l=openbsd-misc&m=163524454303022&w=2
On Fri, Oct 29, 2021 at 4:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sun, Oct 24, 2021 at 10:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Sadly, although the attached proof-of-concept patch allows a
PREFERRED_SEMAPHORES=FUTEX build to pass tests on macOS (which also
lacks native unnamed semas), FreeBSD and Linux (which don't need this
but are interesting to test), and it also works on OpenBSD with
shared_memory_type=sysv, it doesn't work on OpenBSD with
shared_memory_type=mmap (the default). I suspect OpenBSD's futex(2)
has a bug: inherited anonymous shared mmap memory seems to confuse it
so that wakeups are lost. Arrrgh!FWIW I'm trying to follow up with the OpenBSD list over here, because
it'd be nice to get that working:https://marc.info/?l=openbsd-misc&m=163524454303022&w=2
This has been fixed. So now there are working basic futexes on Linux,
macOS, {Free,Open,Net,Dragonfly}BSD (though capabilities beyond basic
wait/wake vary, as do APIs). So the question is whether it would be
worth trying to do our own futex-based semaphores, as sketched above,
just for the benefit of the OSes where the available built-in
semaphores are of the awkward SysV kind, namely macOS, NetBSD and
OpenBSD. Perhaps we shouldn't waste our time with that, and should
instead plan to use futexes for a more ambitious lwlock rewrite.
Thomas Munro <thomas.munro@gmail.com> writes:
This has been fixed. So now there are working basic futexes on Linux,
macOS, {Free,Open,Net,Dragonfly}BSD (though capabilities beyond basic
wait/wake vary, as do APIs). So the question is whether it would be
worth trying to do our own futex-based semaphores, as sketched above,
just for the benefit of the OSes where the available built-in
semaphores are of the awkward SysV kind, namely macOS, NetBSD and
OpenBSD. Perhaps we shouldn't waste our time with that, and should
instead plan to use futexes for a more ambitious lwlock rewrite.
I kind of like the latter idea, but I wonder how we make it coexist
with (admittedly legacy) code for OSes that don't have usable futexes.
regards, tom lane
On Sat, Nov 20, 2021 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
This has been fixed. So now there are working basic futexes on Linux,
macOS, {Free,Open,Net,Dragonfly}BSD (though capabilities beyond basic
wait/wake vary, as do APIs). So the question is whether it would be
worth trying to do our own futex-based semaphores, as sketched above,
just for the benefit of the OSes where the available built-in
semaphores are of the awkward SysV kind, namely macOS, NetBSD and
OpenBSD. Perhaps we shouldn't waste our time with that, and should
instead plan to use futexes for a more ambitious lwlock rewrite.I kind of like the latter idea, but I wonder how we make it coexist
with (admittedly legacy) code for OSes that don't have usable futexes.
One very rough idea, not yet tried, is that they could keep using
semaphores, but use them to implement fake futexes. We'd put them in
wait lists that live in a shared memory hash table (the futex address
is the key, with some extra work needed for DSM-resident futexes),
with per-bucket spinlocks so that you can perform the value check
atomically with the decision to start waiting.