Regression tests fail on OpenBSD due to low semmns value

Started by Alexander Lakhinabout 1 year ago17 messages
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

A recent buildfarm timeout failure on sawshark [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sawshark&dt=2024-12-11%2012%3A20%3A05 made me wonder, what's
wrong with that animal — beside that failure, this animal (running on
OpenBSD 7.4) produced "too many clients" errors from time to time, e. g.,
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sawshark&dt=2024-07-22%2001%3A20%3A22

I deployed OpenBSD 7.4 locally and reproduced "too many clients" and that
hang as well. It turned out that OpenBSD has semmns as low as 60 (see [4]https://man.openbsd.org/options)
and as a consequence, initdb sets max_connections = 20 for the regression
test database. (This can be helpful sometimes, see e.g., [5]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=73c9f91a1.) At the same
time, paralell_schedule contains groups of 20 tests, for instance:
# parallel group (20 tests):  select_into random delete select_having select_distinct_on case prepared_xacts namespace
select_implicit union arrays portals transactions select_distinct subselect update join aggregates hash_index btree_index

Moreover, prepared_xacts performs "\c", and it adds one more connection
for a short time, according to postmaster.log:
2024-12-16 06:18:20.290 EET [regression][1563560:91][client backend] [pg_regress/prepared_xacts] LOG:  statement: rollback;
...
2024-12-16 06:18:20.290 EET [regression][1563561:2][client backend] [[unknown]] FATAL:  sorry, too many clients already
...
2024-12-16 06:18:20.291 EET [regression][1563560:95][client backend] [pg_regress/prepared_xacts] LOG:  disconnection:
session time: 0:00:00.018 user=law database=regression host=[local]

sysctl kern.seminfo.semmns=120 makes the issue go away on this OS;
on the hand, "too many clients" failures can be reproduced on other OS,
with "max_connections=20" in TEMP_CONFIG.

As to the hang, it can be reproduced easily with:
TEMP_CONFIG containing
max_connections=2
superuser_reserved_connections=0

and parallel_schedule as simple as:
test: transactions prepared_xacts
test: transactions prepared_xacts

Running `TEMP_CONFIG=.../extra.config make -s check`, I can see:
# +++ regress check in src/test/regress +++
...
# parallel group (2 tests):  prepared_xacts transactions
not ok 1     + transactions                               56 ms
not ok 2     + prepared_xacts                             21 ms
# (test process exited with exit code 2)
# parallel group (2 tests):
### the test is hanging here ###

with one backend waiting inside:
#0  0x000070c41ed2a007 in epoll_wait (epfd=6, events=0x629f1ce529e8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x0000629f1410d64a in WaitEventSetWaitBlock (set=0x629f1ce52980, cur_timeout=-1, occurred_events=0x7ffd4c4ffed0,
nevents=1) at latch.c:1564
#2  0x0000629f1410d534 in WaitEventSetWait (set=0x629f1ce52980, timeout=-1, occurred_events=0x7ffd4c4ffed0, nevents=1,
wait_event_info=134217779) at latch.c:1510
#3  0x0000629f1410c764 in WaitLatch (latch=0x70c41b86bc24, wakeEvents=33, timeout=0, wait_event_info=134217779) at
latch.c:538
#4  0x0000629f1413d032 in ProcWaitForSignal (wait_event_info=134217779) at proc.c:1893
#5  0x0000629f14132eb9 in GetSafeSnapshot (origSnapshot=0x629f147ad360 <CurrentSnapshotData>) at predicate.c:1579
#6  0x0000629f14133261 in GetSerializableTransactionSnapshot (snapshot=0x629f147ad360 <CurrentSnapshotData>) at
predicate.c:1695
#7  0x0000629f143afafe in GetTransactionSnapshot () at snapmgr.c:253
#8  0x0000629f1414a7b8 in exec_simple_query (query_string=0x629f1ce580f0 "SELECT * FROM writetest;") at postgres.c:1172
...

So GetSafeSnapshot() waits indefinitely for possibleUnsafeConflicts to
become empty (for other backend to remove itself from the list of possible conflicts
inside ReleasePredicateLocks()), but it doesn't happen.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sawshark&amp;dt=2024-12-11%2012%3A20%3A05
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sawshark&amp;dt=2024-07-22%2001%3A20%3A22
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sawshark&amp;dt=2024-11-25%2006%3A20%3A22
[4]: https://man.openbsd.org/options
[5]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=73c9f91a1

Best regards,
Alexander

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#1)
Re: Regression tests fail on OpenBSD due to low semmns value

Alexander Lakhin <exclusion@gmail.com> writes:

I deployed OpenBSD 7.4 locally and reproduced "too many clients" and that
hang as well. It turned out that OpenBSD has semmns as low as 60 (see [4])
and as a consequence, initdb sets max_connections = 20 for the regression
test database. (This can be helpful sometimes, see e.g., [5].) At the same
time, paralell_schedule contains groups of 20 tests, for instance:

Yeah. That was more-or-less okay before we invented parallel query,
but now there needs to be some headroom. I've thought about adjusting
initdb to not allow max_connections less than 25 (can't remember if
I actually proposed that on-list though). The other way would be to
rearrange parallel_schedule to make the max group size less than 20,
but that seems like a lot of effort for little benefit.

FTR, NetBSD also has unreasonably tiny semaphore settings out-of-the
box. mamba's host is using

kern.ipc.semmni=100
kern.ipc.semmns=1000

and for that matter

kern.maxvnodes=60000
kern.maxproc=1000
kern.maxfiles=10000

...
So GetSafeSnapshot() waits indefinitely for possibleUnsafeConflicts to
become empty (for other backend to remove itself from the list of possible conflicts
inside ReleasePredicateLocks()), but it doesn't happen.

This seems like an actual bug?

regards, tom lane

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: Regression tests fail on OpenBSD due to low semmns value

On 2024-12-16 Mo 12:23 AM, Tom Lane wrote:

Alexander Lakhin<exclusion@gmail.com> writes:

I deployed OpenBSD 7.4 locally and reproduced "too many clients" and that
hang as well. It turned out that OpenBSD has semmns as low as 60 (see [4])
and as a consequence, initdb sets max_connections = 20 for the regression
test database. (This can be helpful sometimes, see e.g., [5].) At the same
time, paralell_schedule contains groups of 20 tests, for instance:

Yeah. That was more-or-less okay before we invented parallel query,
but now there needs to be some headroom. I've thought about adjusting
initdb to not allow max_connections less than 25 (can't remember if
I actually proposed that on-list though). The other way would be to
rearrange parallel_schedule to make the max group size less than 20,
but that seems like a lot of effort for little benefit.

25 seems perfectly reasonable, these days. The current minimum was set
nearly 7 years ago.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: Regression tests fail on OpenBSD due to low semmns value

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-12-16 Mo 12:23 AM, Tom Lane wrote:

Yeah. That was more-or-less okay before we invented parallel query,
but now there needs to be some headroom. I've thought about adjusting
initdb to not allow max_connections less than 25 (can't remember if
I actually proposed that on-list though). The other way would be to
rearrange parallel_schedule to make the max group size less than 20,
but that seems like a lot of effort for little benefit.

25 seems perfectly reasonable, these days. The current minimum was set
nearly 7 years ago.

I poked at this a bit on an OpenBSD installation. The out-of-the-box
value of kern.seminfo.semmns seems to be 60, as Alexander said.
It turns out that we can run under that with max_connections = 20,
but not any higher value, the reason being that the number of
semaphores we need is

MaxConnections +
autovacuum_max_workers + 1 +
max_worker_processes +
max_wal_senders +
NUM_AUXILIARY_PROCS

or 20 + 3 + 1 + 8 + 10 + 6 = 48. We allocate semaphores in groups
of SEMAS_PER_SET (16), plus one for identification purposes,
so with this many semaphores needed we create 3 sets of 17 semaphores
= 51 semaphores. One more requested semaphore would put us up to 68
semaphores which is more than OpenBSD's SEMMNS. So we're already on
the hairy edge here.

Now we could just blow this off and say that we can't run on OpenBSD
at all without an increase in kern.seminfo.semmns. But that seems a
little sad, because there are easy things we could do to make this
less tight:

* Why in the world is the default value of max_wal_senders 10?
I find it hard to believe that there are installations using
more than about 3, and even there you can bet they are changing
a lot of other parameters.

* There's no reason that SEMAS_PER_SET has to be a power of 2. The
commentary in sysv_sema.c says "It must be *less than* your kernel's
SEMMSL (max semaphores per set) parameter, which is often around 25".
If we made it, say, 19, then we could allocate 3 sets (really 20
semaphores) and accommodate up to 57 processes without having
to have an increase in kern.seminfo.semmns.

In short then, I propose:

* Increase initdb's minimum probed max_connections to 25.

* Reduce default value of max_wal_senders to 3 (or maybe 5
if people think that's too drastic).

* Change sysv_sema.c's SEMAS_PER_SET to 19.

On a stock OpenBSD setup, I find that this actually lets
us set max_connections to 30, so that there's some headroom
for the inevitable future growth of the number of background
processes.

Of course, none of this is going to save owners of *BSD
buildfarm animals from needing to increase the kernel
parameters, because the regression tests launch multiple
postmasters in places. But I think it's friendly to novice
PG users if they can launch one postmaster without that.

regards, tom lane

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: Regression tests fail on OpenBSD due to low semmns value

Hi,

On 2024-12-16 12:52:46 -0500, Tom Lane wrote:

or 20 + 3 + 1 + 8 + 10 + 6 = 48. We allocate semaphores in groups
of SEMAS_PER_SET (16), plus one for identification purposes,
so with this many semaphores needed we create 3 sets of 17 semaphores
= 51 semaphores. One more requested semaphore would put us up to 68
semaphores which is more than OpenBSD's SEMMNS. So we're already on
the hairy edge here.

Now we could just blow this off and say that we can't run on OpenBSD
at all without an increase in kern.seminfo.semmns.

Given the numbers of users (or even testers) on openbsd that seems like it
might be a reasonable answer... But, see below.

* Why in the world is the default value of max_wal_senders 10?
I find it hard to believe that there are installations using
more than about 3, and even there you can bet they are changing
a lot of other parameters.

I don't think it's that rare as logical replication also needs a walsender
slot... I think we're going to hurt far more users by lowering this than we'd
help.

But I think it might be sane to have initdb probe a lower max_wal_senders
alongside lower max_connections settings. It seems to make sense to have a
lower max_wal_senders settings on machines that don't have enough resources to
run with max_connections=100.

Greetings,

Andres Freund

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: Regression tests fail on OpenBSD due to low semmns value

Andres Freund <andres@anarazel.de> writes:

On 2024-12-16 12:52:46 -0500, Tom Lane wrote:

* Why in the world is the default value of max_wal_senders 10?
I find it hard to believe that there are installations using
more than about 3, and even there you can bet they are changing
a lot of other parameters.

I don't think it's that rare as logical replication also needs a walsender
slot... I think we're going to hurt far more users by lowering this than we'd
help.

Hm, okay. If we just twiddle SEMAS_PER_SET we can still have
max_connections = 25 with max_wal_senders = 10, so doing that
much seems free.

regards, tom lane

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#1)
3 attachment(s)
Re: Regression tests fail on OpenBSD due to low semmns value

On Mon, Dec 16, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

It turned out that OpenBSD has semmns as low as 60 (see [4])

Whenever I run into this, or my Mac requires manual ipcrm to clean up
leaked SysV kernel junk, I rebase my patch for sema_kind = 'futex'.
Here it goes. It could be updated to support NetBSD I believe, but I
didn't try as its futex stuff came out later.

Then I remember why I didn't go anywhere with it. It triggers a
thought loop about flipping it all around: use futexes to implement
lwlocks directly in place, and get rid of semaphores completely, but
that involves a few rabbit holes and sub-projects. From memory:
classic r/w lock implementation on futexes is tricky but doable in the
portability constraints, futex fallback implementation even works
surprisingly well but has fun memory map sub-problems, actually lwlock
is not really a classic r/w lock as it has sprouted extra funky APIs
that lead the intrepid rabbit-holer to design an entirely different
new concurrency primitive that is really wanted for those users, a
couple of other places use raw semaphores directly namely procarray.c
and clog.c and if you stare at those for long you will be overwhelmed
with a desire to rewrite them, EOVERFLOW.

Attachments:

0001-A-basic-API-for-futexes.patchapplication/x-patch; name=0001-A-basic-API-for-futexes.patchDownload
From 42054d64062da58e44a383d0ed0c1c6bb2ba88e1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 24 Oct 2021 21:48:26 +1300
Subject: [PATCH 1/3] A basic API for futexes.

A thin wrapper for basic 32 bit futex wait and wake.  Currently, it maps
to native support on Linux, DragonFlyBSD, FreeBSD, OpenBSD and macOS,
with detection via configure/meson.

NetBSD could probably be added, no investigated.  Windows'
WaitOnAddress() can't because it only works between threads.  A
latch-based backend-only fallback implementation is plausible.
---
 configure                    |   4 +-
 configure.ac                 |   5 +
 meson.build                  |   5 +
 src/backend/port/meson.build |   2 +-
 src/include/pg_config.h.in   |  15 +++
 src/include/port/pg_futex.h  | 171 +++++++++++++++++++++++++++++++++++
 6 files changed, 199 insertions(+), 3 deletions(-)
 create mode 100644 src/include/port/pg_futex.h

diff --git a/configure b/configure
index 518c33b73a9..6eb25178dab 100755
--- a/configure
+++ b/configure
@@ -13227,7 +13227,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h mbarrier.h sys/epoll.h sys/event.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h termios.h ucred.h xlocale.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h linux/futex.h mbarrier.h sys/epoll.h sys/event.h sys/futex.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h sys/umtx.h termios.h ucred.h xlocale.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
@@ -15044,7 +15044,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile copy_file_range elf_aux_info getauxval getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in __ulock_wait backtrace_symbols copyfile copy_file_range elf_aux_info getauxval getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range umtx_sleep uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 247ae97fa4c..6b4f3e0f2e5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1438,14 +1438,17 @@ AC_CHECK_HEADERS(m4_normalize([
 	execinfo.h
 	getopt.h
 	ifaddrs.h
+	linux/futex.h
 	mbarrier.h
 	sys/epoll.h
 	sys/event.h
+	sys/futex.h
 	sys/personality.h
 	sys/prctl.h
 	sys/procctl.h
 	sys/signalfd.h
 	sys/ucred.h
+	sys/umtx.h
 	termios.h
 	ucred.h
 	xlocale.h
@@ -1707,6 +1710,7 @@ LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
 AC_CHECK_FUNCS(m4_normalize([
+	__ulock_wait
 	backtrace_symbols
 	copyfile
 	copy_file_range
@@ -1727,6 +1731,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strsignal
 	syncfs
 	sync_file_range
+	umtx_sleep
 	uselocale
 	wcstombs_l
 ]))
diff --git a/meson.build b/meson.build
index e5ce437a5c7..5c9775f1a6e 100644
--- a/meson.build
+++ b/meson.build
@@ -2380,15 +2380,18 @@ header_checks = [
   'execinfo.h',
   'getopt.h',
   'ifaddrs.h',
+  'linux/futex.h',
   'mbarrier.h',
   'strings.h',
   'sys/epoll.h',
   'sys/event.h',
+  'sys/futex.h',
   'sys/personality.h',
   'sys/prctl.h',
   'sys/procctl.h',
   'sys/signalfd.h',
   'sys/ucred.h',
+  'sys/umtx.h',
   'termios.h',
   'ucred.h',
   'xlocale.h',
@@ -2611,6 +2614,7 @@ endif
 # XXX: Might be worth conditioning some checks on the OS, to avoid doing
 # unnecessary checks over and over, particularly on windows.
 func_checks = [
+  ['__ulock_wait'],
   ['backtrace_symbols', {'dependencies': [execinfo_dep]}],
   ['clock_gettime', {'dependencies': [rt_dep], 'define': false}],
   ['copyfile'],
@@ -2654,6 +2658,7 @@ func_checks = [
   ['strsignal'],
   ['sync_file_range'],
   ['syncfs'],
+  ['umtx_sleep'],
   ['uselocale'],
   ['wcstombs_l'],
 ]
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 7820e86016d..e34499bafb3 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -5,7 +5,7 @@ backend_sources += files(
 )
 
 
-if cdata.has('USE_UNNAMED_POSIX_SEMAPHORES') or cdata.has('USE_NAMED_POSIX_SEMAPHORES')
+if cdata.has('USE_UNNAMED_POSIX_SEMAPHORES') or cdata.has('USE_NAMED_POSIX_SEMAPHORES') or cdata.has('USE_FUTEX_SEMAPHORES')
   backend_sources += files('posix_sema.c')
 endif
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..19cbf6e74ee 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -265,6 +265,9 @@
 /* Define to 1 if you have the `zstd' library (-lzstd). */
 #undef HAVE_LIBZSTD
 
+/* Define to 1 if you have the <linux/futex.h> header file. */
+#undef HAVE_LINUX_FUTEX_H
+
 /* Define to 1 if you have the <mbarrier.h> header file. */
 #undef HAVE_MBARRIER_H
 
@@ -418,6 +421,9 @@
 /* Define to 1 if you have the <sys/event.h> header file. */
 #undef HAVE_SYS_EVENT_H
 
+/* Define to 1 if you have the <sys/futex.h> header file. */
+#undef HAVE_SYS_FUTEX_H
+
 /* Define to 1 if you have the <sys/personality.h> header file. */
 #undef HAVE_SYS_PERSONALITY_H
 
@@ -439,6 +445,9 @@
 /* Define to 1 if you have the <sys/ucred.h> header file. */
 #undef HAVE_SYS_UCRED_H
 
+/* Define to 1 if you have the <sys/umtx.h> header file. */
+#undef HAVE_SYS_UMTX_H
+
 /* Define to 1 if you have the <termios.h> header file. */
 #undef HAVE_TERMIOS_H
 
@@ -448,6 +457,9 @@
 /* Define to 1 if you have the <ucred.h> header file. */
 #undef HAVE_UCRED_H
 
+/* Define to 1 if you have the `umtx_sleep' function. */
+#undef HAVE_UMTX_SLEEP
+
 /* Define to 1 if the system has the type `union semun'. */
 #undef HAVE_UNION_SEMUN
 
@@ -538,6 +550,9 @@
 /* Define to 1 if your compiler understands _Static_assert. */
 #undef HAVE__STATIC_ASSERT
 
+/* Define to 1 if you have the `__ulock_wait' function. */
+#undef HAVE___ULOCK_WAIT
+
 /* Define as the maximum alignment requirement of any C data type. */
 #undef MAXIMUM_ALIGNOF
 
diff --git a/src/include/port/pg_futex.h b/src/include/port/pg_futex.h
new file mode 100644
index 00000000000..e5ae05d1d5a
--- /dev/null
+++ b/src/include/port/pg_futex.h
@@ -0,0 +1,171 @@
+/*
+ * Minimal wrapper over futex APIs.
+ */
+
+#ifndef PG_FUTEX_H
+#define PG_FUTEX_H
+
+#if defined(HAVE_LINUX_FUTEX_H)
+
+/* https://man7.org/linux/man-pages/man2/futex.2.html */
+
+#include <linux/futex.h>
+#include <sys/syscall.h>
+
+#elif defined(HAVE_SYS_FUTEX_H)
+
+/* https://man.openbsd.org/futex, since OpenBSD 6.2. */
+
+#include <sys/time.h>
+#include <sys/futex.h>
+
+#elif defined(HAVE_SYS_UMTX_H)
+
+/* https://www.freebsd.org/cgi/man.cgi?query=_umtx_op */
+
+#include <sys/types.h>
+#include <sys/umtx.h>
+
+#elif defined(HAVE_UMTX_SLEEP)
+
+/* https://man.dragonflybsd.org/?command=umtx&section=2 */
+
+#include <unistd.h>
+
+#elif defined(HAVE___ULOCK_WAIT)
+
+/*
+ * This interface is undocumented, but provided by libSystem.dylib since
+ * xnu-3789.1.32 (macOS 10.12, 2016) and is used by eg libc++.
+ *
+ * https://github.com/apple/darwin-xnu/blob/main/bsd/kern/sys_ulock.c
+ * https://github.com/apple/darwin-xnu/blob/main/bsd/sys/ulock.h
+ */
+
+#include <stdint.h>
+
+#define UL_COMPARE_AND_WAIT_SHARED		3
+#define ULF_WAKE_ALL					0x00000100
+
+#ifdef __cplusplus
+extern "C"
+{
+#endif
+
+extern int	__ulock_wait(uint32_t operation,
+						 void *addr,
+						 uint64_t value,
+						 uint32_t timeout);
+extern int	__ulock_wake(uint32_t operation,
+						 void *addr,
+						 uint64_t wake_value);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
+
+#ifdef __cplusplus
+extern "C"
+{
+#endif
+
+/*
+ * Wait for someone to call pg_futex_wake() for the same address, with an
+ * initial check that the value pointed to by 'fut' matches 'value' and an
+ * optional timeout.  Returns 0 when woken, and otherwise -1, with errno set to
+ * EAGAIN if the initial value check fails, and otherwise errors including
+ * EINTR, ETIMEDOUT and EFAULT.
+ */
+static int
+pg_futex_wait_u32(volatile void *fut,
+				  uint32 value,
+				  struct timespec *timeout)
+{
+#if defined(HAVE_LINUX_FUTEX_H)
+	if (syscall(SYS_futex, fut, FUTEX_WAIT, value, timeout, 0, 0) == 0)
+		return 0;
+#elif defined(HAVE_SYS_FUTEX_H)
+	if ((errno = futex((void *) fut, FUTEX_WAIT, (int) value, timeout, NULL)) == 0)
+		return 0;
+	if (errno == ECANCELED)
+		errno = EINTR;
+#elif defined(HAVE_SYS_UMTX_H)
+	if (_umtx_op((void *) fut, UMTX_OP_WAIT_UINT, value, 0, timeout) == 0)
+		return 0;
+#elif defined(HAVE_UMTX_SLEEP)
+	if (umtx_sleep((volatile const int *) fut,
+				   (int) value,
+				   timeout ? timeout->tv_sec * 1000000 + timeout->tv_nsec / 1000 : 0) == 0)
+		return 0;
+	if (errno == EBUSY)
+		errno = EAGAIN;
+#elif defined (HAVE___ULOCK_WAIT)
+	if (__ulock_wait(UL_COMPARE_AND_WAIT_SHARED,
+					 (void *) fut,
+					 value,
+					 timeout ? timeout->tv_sec * 1000000 + timeout->tv_nsec / 1000 : 0) >= 0)
+		return 0;
+#else
+	/*
+	 * If we wanted to simulate futexes on systems that don't have them, here
+	 * we could add a link from our PGPROC struct to a shared memory hash
+	 * table using "fut" (ie address) as the key, then compare *fut == value.
+	 * If false, remove link and fail with EAGAIN.  If true, sleep on proc
+	 * latch.  This wouldn't work for DSM segments; for those, we could search
+	 * for matching DSM segment mappings in this process, and convert the key
+	 * to { segment ID, offset }, just like kernels do internally to make
+	 * inter-process futexes work on shared memory, but... ugh.
+	 */
+	errno = ENOSYS;
+#endif
+
+	Assert(errno != 0);
+
+	return -1;
+}
+
+/*
+ * Wake up to nwaiters waiters that currently wait on the same address as
+ * 'fut'.  Returns 0 on success, and -1 on failure, with errno set.  Though
+ * some of these interfaces can tell us how many were woken, they can't all do
+ * that, so we'll hide that information.
+ */
+static int
+pg_futex_wake(volatile void *fut, int nwaiters)
+{
+#if defined(HAVE_LINUX_FUTEX_H)
+	if (syscall(SYS_futex, fut, FUTEX_WAKE, nwaiters, NULL, 0, 0) >= 0)
+		return 0;
+#elif defined(HAVE_SYS_FUTEX_H)
+	if (futex(fut, FUTEX_WAKE, nwaiters, NULL, NULL) >= 0)
+		return 0;
+#elif defined(HAVE_SYS_UMTX_H)
+	if (_umtx_op((void *) fut, UMTX_OP_WAKE, nwaiters, 0, 0) == 0)
+		return 0;
+#elif defined(HAVE_UMTX_SLEEP)
+	if (umtx_wakeup((volatile const int *) fut, nwaiters) == 0)
+		return 0;
+#elif defined (HAVE___ULOCK_WAIT)
+	if (__ulock_wake(UL_COMPARE_AND_WAIT_SHARED | (nwaiters > 1 ? ULF_WAKE_ALL : 0),
+					 (void *) fut,
+					 0) >= 0)
+		return 0;
+	if (errno == ENOENT)
+		return 0;
+#else
+	/* No implementation available. */
+	errno = ENOSYS;
+#endif
+
+	Assert(errno != 0);
+
+	return -1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif							/* PG_FUTEX_H */
-- 
2.47.1

0002-Add-futex-based-semaphore-replacement.patchapplication/x-patch; name=0002-Add-futex-based-semaphore-replacement.patchDownload
From 4614b62e6006f202a3a739175b73684e51c58914 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 24 Oct 2021 21:48:26 +1300
Subject: [PATCH 2/3] Add futex-based semaphore replacement.

Provide a drop-in replacement for POSIX unnamed semaphores using
futexes.  This is useful for systems that don't have unnamed semaphores
at all, or don't have unnamed semaphores that work inter-process.  This
should be more convenient because the alternatives require kernel
resources and configuration and can also leak in various scenarios.
---
 configure                     |  16 +++++-
 configure.ac                  |  16 +++++-
 src/backend/port/posix_sema.c | 100 +++++++++++++++++++++++++++++++++-
 src/include/pg_config.h.in    |   3 +
 4 files changed, 128 insertions(+), 7 deletions(-)

diff --git a/configure b/configure
index 6eb25178dab..acbd7cecaac 100755
--- a/configure
+++ b/configure
@@ -17632,6 +17632,10 @@ if test "$ac_res" != no; then :
 fi
 
   fi
+  if test x"$PREFERRED_SEMAPHORES" = x"FUTEX" ; then
+    # Need futex implementation for this
+    USE_FUTEX_SEMAPHORES=1
+  fi
   { $as_echo "$as_me:${as_lineno-$LINENO}: checking which semaphore API to use" >&5
 $as_echo_n "checking which semaphore API to use... " >&6; }
   if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
@@ -17648,11 +17652,19 @@ $as_echo "#define USE_UNNAMED_POSIX_SEMAPHORES 1" >>confdefs.h
       SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
       sematype="unnamed POSIX"
     else
+      if test x"$USE_FUTEX_SEMAPHORES" = x"1" ; then
+
+$as_echo "#define USE_FUTEX_SEMAPHORES 1" >>confdefs.h
+
+        SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
+        sematype="futex"
+      else
 
 $as_echo "#define USE_SYSV_SEMAPHORES 1" >>confdefs.h
 
-      SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
-      sematype="System V"
+        SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
+        sematype="System V"
+      fi
     fi
   fi
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: $sematype" >&5
diff --git a/configure.ac b/configure.ac
index 6b4f3e0f2e5..9001c85a74e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2164,6 +2164,10 @@ if test "$PORTNAME" != "win32"; then
     # Need sem_init for this
     AC_SEARCH_LIBS(sem_init, [rt pthread], [USE_UNNAMED_POSIX_SEMAPHORES=1])
   fi
+  if test x"$PREFERRED_SEMAPHORES" = x"FUTEX" ; then
+    # Need futex implementation for this
+    USE_FUTEX_SEMAPHORES=1
+  fi
   AC_MSG_CHECKING([which semaphore API to use])
   if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
     AC_DEFINE(USE_NAMED_POSIX_SEMAPHORES, 1, [Define to select named POSIX semaphores.])
@@ -2175,9 +2179,15 @@ if test "$PORTNAME" != "win32"; then
       SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
       sematype="unnamed POSIX"
     else
-      AC_DEFINE(USE_SYSV_SEMAPHORES, 1, [Define to select SysV-style semaphores.])
-      SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
-      sematype="System V"
+      if test x"$USE_FUTEX_SEMAPHORES" = x"1" ; then
+        AC_DEFINE(USE_FUTEX_SEMAPHORES, 1, [Define to select futex semaphores.])
+        SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
+        sematype="futex"
+      else
+        AC_DEFINE(USE_SYSV_SEMAPHORES, 1, [Define to select SysV-style semaphores.])
+        SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
+        sematype="System V"
+      fi
     fi
   fi
   AC_MSG_RESULT([$sematype])
diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 64186ec0a7e..88feec98d40 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -36,6 +36,10 @@
 #include "storage/pg_sema.h"
 #include "storage/shmem.h"
 
+#if defined(USE_FUTEX_SEMAPHORES)
+#include "port/atomics.h"
+#include "port/pg_futex.h"
+#endif
 
 /* see file header comment */
 #if defined(USE_NAMED_POSIX_SEMAPHORES) && defined(EXEC_BACKEND)
@@ -45,6 +49,9 @@
 typedef union SemTPadded
 {
 	sem_t		pgsem;
+#if defined(USE_FUTEX_SEMAPHORES)
+	pg_atomic_uint32 futexsem;
+#endif
 	char		pad[PG_CACHE_LINE_SIZE];
 } SemTPadded;
 
@@ -70,6 +77,72 @@ static int	nextSemKey;			/* next name to try */
 
 static void ReleaseSemaphores(int status, Datum arg);
 
+#ifdef USE_FUTEX_SEMAPHORES
+
+/*
+ * An implementation of POSIX unnamed semaphores in shared memory, for OSes
+ * that lack them but have futexes.
+ */
+
+/*
+ * Like standard sem_init() with pshared set to 1, meaning that it can work in
+ * shared memory.
+ */
+static void
+pg_futex_sem_init(pg_atomic_uint32 *fut, uint32 value)
+{
+	pg_atomic_init_u32(fut, value);
+}
+
+/*
+ * Like standard sem_post().
+ */
+static int
+pg_futex_sem_post(pg_atomic_uint32 *fut)
+{
+	pg_atomic_fetch_add_u32(fut, 1);
+
+	/*
+	 * XXX If some bits held a waiter count, then the result of the above could
+	 * be checked to see if we can skip this call.  Currently we use semaphores
+	 * as the slow path for lwlocks, so there is always expected to be a
+	 * waiter.
+	 */
+	return pg_futex_wake(fut, INT_MAX);
+}
+
+/*
+ * Like standard sem_wait().
+ */
+static int
+pg_futex_sem_wait(pg_atomic_uint32 *fut)
+{
+	uint32		value = 1;
+
+	/*
+	 * The futex API takes void *, so there is no type checking or casting.
+	 * Assert that pg_atomic_uint32 is really just a wrapped uint32_t as
+	 * required by the kernel for 32 bit futex pre-check.
+	 */
+	StaticAssertStmt(sizeof(*fut) == sizeof(uint32), "unexpected size");
+
+	while (!pg_atomic_compare_exchange_u32(fut, &value, value - 1))
+	{
+		if (value == 0)
+		{
+			/* Wait for someone else to move it above 0. */
+			if (pg_futex_wait_u32(fut, 0, NULL) < 0)
+			{
+				if (errno != EAGAIN)
+					return -1;
+				/* The value changed under our feet.  Try again. */
+			}
+		}
+	}
+	return 0;
+}
+
+#endif
 
 #ifdef USE_NAMED_POSIX_SEMAPHORES
 
@@ -124,7 +197,7 @@ PosixSemaphoreCreate(void)
 
 	return mySem;
 }
-#else							/* !USE_NAMED_POSIX_SEMAPHORES */
+#elif defined(USE_UNNAMED_POSIX_SEMAPHORES)
 
 /*
  * PosixSemaphoreCreate
@@ -139,6 +212,7 @@ PosixSemaphoreCreate(sem_t *sem)
 }
 #endif							/* USE_NAMED_POSIX_SEMAPHORES */
 
+#ifndef USE_FUTEX_SEMAPHORES
 
 /*
  * PosixSemaphoreKill	- removes a semaphore
@@ -156,6 +230,7 @@ PosixSemaphoreKill(sem_t *sem)
 		elog(LOG, "sem_destroy failed: %m");
 #endif
 }
+#endif
 
 
 /*
@@ -238,18 +313,22 @@ PGReserveSemaphores(int maxSemas)
 static void
 ReleaseSemaphores(int status, Datum arg)
 {
+#ifdef USE_NAMED_POSIX_SEMAPHORES
 	int			i;
 
-#ifdef USE_NAMED_POSIX_SEMAPHORES
 	for (i = 0; i < numSems; i++)
 		PosixSemaphoreKill(mySemPointers[i]);
 	free(mySemPointers);
 #endif
 
 #ifdef USE_UNNAMED_POSIX_SEMAPHORES
+	int			i;
+
 	for (i = 0; i < numSems; i++)
 		PosixSemaphoreKill(PG_SEM_REF(sharedSemas + i));
 #endif
+
+	/* Futex-based semaphores have no kernel resource to clean up. */
 }
 
 /*
@@ -261,7 +340,9 @@ PGSemaphore
 PGSemaphoreCreate(void)
 {
 	PGSemaphore sema;
+#ifndef USE_FUTEX_SEMAPHORES
 	sem_t	   *newsem;
+#endif
 
 	/* Can't do this in a backend, because static state is postmaster's */
 	Assert(!IsUnderPostmaster);
@@ -274,6 +355,9 @@ PGSemaphoreCreate(void)
 	/* Remember new sema for ReleaseSemaphores */
 	mySemPointers[numSems] = newsem;
 	sema = (PGSemaphore) newsem;
+#elif defined(USE_FUTEX_SEMAPHORES)
+	sema = &sharedSemas[numSems];
+	pg_futex_sem_init(&sema->sem_padded.futexsem, 1);
 #else
 	sema = &sharedSemas[numSems];
 	newsem = PG_SEM_REF(sema);
@@ -293,6 +377,9 @@ PGSemaphoreCreate(void)
 void
 PGSemaphoreReset(PGSemaphore sema)
 {
+#ifdef USE_FUTEX_SEMAPHORES
+	pg_atomic_write_u32(&sema->sem_padded.futexsem, 0);
+#else
 	/*
 	 * There's no direct API for this in POSIX, so we have to ratchet the
 	 * semaphore down to 0 with repeated trywait's.
@@ -308,6 +395,7 @@ PGSemaphoreReset(PGSemaphore sema)
 			elog(FATAL, "sem_trywait failed: %m");
 		}
 	}
+#endif
 }
 
 /*
@@ -323,7 +411,11 @@ PGSemaphoreLock(PGSemaphore sema)
 	/* See notes in sysv_sema.c's implementation of PGSemaphoreLock. */
 	do
 	{
+#if defined(USE_FUTEX_SEMAPHORES)
+		errStatus = pg_futex_sem_wait(&sema->sem_padded.futexsem);
+#else
 		errStatus = sem_wait(PG_SEM_REF(sema));
+#endif
 	} while (errStatus < 0 && errno == EINTR);
 
 	if (errStatus < 0)
@@ -348,7 +440,11 @@ PGSemaphoreUnlock(PGSemaphore sema)
 	 */
 	do
 	{
+#if defined(USE_FUTEX_SEMAPHORES)
+		errStatus = pg_futex_sem_post(&sema->sem_padded.futexsem);
+#else
 		errStatus = sem_post(PG_SEM_REF(sema));
+#endif
 	} while (errStatus < 0 && errno == EINTR);
 
 	if (errStatus < 0)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 19cbf6e74ee..3fe34e91e1b 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -669,6 +669,9 @@
 /* Define to 1 to build with BSD Authentication support. (--with-bsd-auth) */
 #undef USE_BSD_AUTH
 
+/* Define to select futex semaphores. */
+#undef USE_FUTEX_SEMAPHORES
+
 /* Define to build with ICU support. (--with-icu) */
 #undef USE_ICU
 
-- 
2.47.1

0003-Use-futex-based-semaphores-on-macOS.patchapplication/x-patch; name=0003-Use-futex-based-semaphores-on-macOS.patchDownload
From aca7b842282f2f180ff497072079a9c40556f6fc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 26 Oct 2023 18:43:26 +1300
Subject: [PATCH 3/3] Use futex-based semaphores on macOS.

---
 meson.build         |  2 ++
 src/template/darwin | 13 +------------
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/meson.build b/meson.build
index 5c9775f1a6e..0e713f65a66 100644
--- a/meson.build
+++ b/meson.build
@@ -205,6 +205,8 @@ if host_system == 'cygwin'
   mod_link_with_dir = 'libdir'
 
 elif host_system == 'darwin'
+  sema_kind = 'futex'
+
   dlsuffix = '.dylib'
   library_path_var = 'DYLD_LIBRARY_PATH'
 
diff --git a/src/template/darwin b/src/template/darwin
index e8eb9390687..d3c78805401 100644
--- a/src/template/darwin
+++ b/src/template/darwin
@@ -14,17 +14,6 @@ fi
 # Extra CFLAGS for code that will go into a shared library
 CFLAGS_SL=""
 
-# Select appropriate semaphore support.  Darwin 6.0 (macOS 10.2) and up
-# support System V semaphores; before that we have to use named POSIX
-# semaphores, which are less good for our purposes because they eat a
-# file descriptor per backend per max_connection slot.
-case $host_os in
-  darwin[015].*)
-    USE_NAMED_POSIX_SEMAPHORES=1
-    ;;
-  *)
-    USE_SYSV_SEMAPHORES=1
-    ;;
-esac
+USE_FUTEX_SEMAPHORES=1
 
 DLSUFFIX=".dylib"
-- 
2.47.1

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#7)
Re: Regression tests fail on OpenBSD due to low semmns value

Thomas Munro <thomas.munro@gmail.com> writes:

Whenever I run into this, or my Mac requires manual ipcrm to clean up
leaked SysV kernel junk, I rebase my patch for sema_kind = 'futex'.
Here it goes. It could be updated to support NetBSD I believe, but I
didn't try as its futex stuff came out later.

FWIW, I looked at a nearby NetBSD 10.0 machine. It has
/usr/include/sys/futex.h, which includes this enticing comment:

/*
* Definitions for the __futex(2) synchronization primitive.
*
* These definitions are intended to be ABI-compatible with the
* Linux futex(2) system call.
*/

However, the complete lack of any user-level documentation makes
me misdoubt the extent of their commitment to this :-(

I have the same concern about depending on undocumented macOS
APIs. Other than that, getting off of SysV semaphores would be
a nice thing to do.

regards, tom lane

#9Peter Eisentraut
peter@eisentraut.org
In reply to: Andres Freund (#5)
Re: Regression tests fail on OpenBSD due to low semmns value

On 16.12.24 19:19, Andres Freund wrote:

* Why in the world is the default value of max_wal_senders 10?
I find it hard to believe that there are installations using
more than about 3, and even there you can bet they are changing
a lot of other parameters.

I don't think it's that rare as logical replication also needs a walsender
slot... I think we're going to hurt far more users by lowering this than we'd
help.

Here is where this change was originally discussed:
/messages/by-id/CABUevEy4PR_EAvZEzsbF5s+V0eEvw7shJ2t-AUwbHOjT+yRb3A@mail.gmail.com

The low semaphore settings on some BSD systems were also mentioned
there. Did anything change now that it is triggering more issues now?

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#9)
Re: Regression tests fail on OpenBSD due to low semmns value

Peter Eisentraut <peter@eisentraut.org> writes:

* Why in the world is the default value of max_wal_senders 10?

Here is where this change was originally discussed:
/messages/by-id/CABUevEy4PR_EAvZEzsbF5s+V0eEvw7shJ2t-AUwbHOjT+yRb3A@mail.gmail.com

Hmm. There was not a lot in that thread about which specific nonzero
value of max_wal_senders to use, but I do see

After some testing and searching for documentation, it seems that at
least the BSD platforms have a very low default semmns setting
(apparently 60, which leads to max_connections=30).

The low semaphore settings on some BSD systems were also mentioned
there. Did anything change now that it is triggering more issues now?

Yeah, we have more background-process slots reserved by default now.
There's parallel worker slots that were not there in v10, and I think
another one or two random auxiliary processes. So we fail to reach
max_connections=30 now.

As things stand today, we can allocate exactly 20 max_connections
because there are 28 background-process slots if all other parameters
are left at default, and 48 usable semaphores is as many as we
can create under the OpenBSD/NetBSD default of SEMMNS=60. So we're
skating at the hairy edge of whether the parallel regression tests
work reliably, and the next time somebody invents a new kind of
auxiliary process, it will stop working altogether.

My proposal to increase SEMAS_PER_SET to 19 would provide us nine
more usable semaphores under the default *BSD configuration.
With the change to initdb to probe 25 not 20 for max_connections,
five of those would go into max_connections and we'd have four
spares for new background processes. Maybe by the time that runs
out, we'll have found a better alternative to SysV semaphores.

The only downside I can see is that the current setup is able
to coexist with some other service that uses a small number of
SysV semaphores, while with these changes that would not work
without raising the platform SEMMNS limit. Realistically though
you're going to want to raise the platform limit for any sort of
production usage of Postgres. I think this discussion is just
about whether "make; make check" will work out-of-the-box, which
I think is a good goal to have.

regards, tom lane

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: Regression tests fail on OpenBSD due to low semmns value

Hi,

On 2024-12-18 11:23:23 -0500, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

After some testing and searching for documentation, it seems that at
least the BSD platforms have a very low default semmns setting
(apparently 60, which leads to max_connections=30).

The low semaphore settings on some BSD systems were also mentioned
there. Did anything change now that it is triggering more issues now?

Yeah, we have more background-process slots reserved by default now.
There's parallel worker slots that were not there in v10, and I think
another one or two random auxiliary processes. So we fail to reach
max_connections=30 now.

As things stand today, we can allocate exactly 20 max_connections
because there are 28 background-process slots if all other parameters
are left at default, and 48 usable semaphores is as many as we
can create under the OpenBSD/NetBSD default of SEMMNS=60. So we're
skating at the hairy edge of whether the parallel regression tests
work reliably, and the next time somebody invents a new kind of
auxiliary process, it will stop working altogether.

My proposal to increase SEMAS_PER_SET to 19 would provide us nine
more usable semaphores under the default *BSD configuration.
With the change to initdb to probe 25 not 20 for max_connections,
five of those would go into max_connections and we'd have four
spares for new background processes. Maybe by the time that runs
out, we'll have found a better alternative to SysV semaphores.

The only downside I can see is that the current setup is able
to coexist with some other service that uses a small number of
SysV semaphores, while with these changes that would not work
without raising the platform SEMMNS limit. Realistically though
you're going to want to raise the platform limit for any sort of
production usage of Postgres. I think this discussion is just
about whether "make; make check" will work out-of-the-box, which
I think is a good goal to have.

Maybe we should consider switching those platforms to unnamed posix
semaphores?

There were some not so great performance numbers in the past:
* openbsd, 2021: /messages/by-id/3010886.1634950831@sss.pgh.pa.us
* netbsd, 2022: /messages/by-id/20220828013914.5hzc7kvcpum5h2yn@awork3.anarazel.de

But TBH, nobody uses openbsd and netbsd if performance matters even one
iota. And considering a bunch of postgres changes to deal with idiotic default
sysv limits doesn't feal like a sensible thing to do in 2024.

Greetings,

Andres Freund

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: Regression tests fail on OpenBSD due to low semmns value

Andres Freund <andres@anarazel.de> writes:

Maybe we should consider switching those platforms to unnamed posix
semaphores?

I already looked into that. OpenBSD still doesn't have cross-process
posix semaphores, at least according to its man page. NetBSD does,
but they consume an FD per sema, which is actually worse because
the default max-open-files-per-process is none too large either.

But TBH, nobody uses openbsd and netbsd if performance matters even one
iota. And considering a bunch of postgres changes to deal with idiotic default
sysv limits doesn't feal like a sensible thing to do in 2024.

Yeah, I would not expend a lot of effort on this. But two one-line
changes doesn't seem unreasonable.

regards, tom lane

#13Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#12)
Re: Regression tests fail on OpenBSD due to low semmns value

Hi,

On 2024-12-18 12:00:48 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Maybe we should consider switching those platforms to unnamed posix
semaphores?

I already looked into that. OpenBSD still doesn't have cross-process
posix semaphores, at least according to its man page.

Ugh, I had missed that:

This implementation does not support shared semaphores, and reports this fact
by setting errno to EPERM. This is perhaps a stretch of the intention of
POSIX, but is compliant, with the caveat that sem_init() always reports a
permissions error when an attempt to create a shared semaphore is made.

That's such a stupid argument that I kinda just want to rip out openbsd
support out of postgres :)

NetBSD does, but they consume an FD per sema, which is actually worse
because the default max-open-files-per-process is none too large either.

Doesn't seem that bad on netbsd 10. Via Bilal's netbsd CI patch, I get:
# sysctl proc.curproc.rlimit.descriptors
proc.curproc.rlimit.descriptors.soft = 1024
proc.curproc.rlimit.descriptors.hard = 3404

But TBH, nobody uses openbsd and netbsd if performance matters even one
iota. And considering a bunch of postgres changes to deal with idiotic default
sysv limits doesn't feal like a sensible thing to do in 2024.

Yeah, I would not expend a lot of effort on this. But two one-line
changes doesn't seem unreasonable.

Agreed for stuff like SEMAS_PER_SET. I just don't think it's a good idea to
invest in lowering our default semaphore requirements by lowering various
default process limits or such.

Greetings,

Andres Freund

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#13)
Re: Regression tests fail on OpenBSD due to low semmns value

Andres Freund <andres@anarazel.de> writes:

On 2024-12-18 12:00:48 -0500, Tom Lane wrote:

NetBSD does, but they consume an FD per sema, which is actually worse
because the default max-open-files-per-process is none too large either.

Doesn't seem that bad on netbsd 10. Via Bilal's netbsd CI patch, I get:
# sysctl proc.curproc.rlimit.descriptors
proc.curproc.rlimit.descriptors.soft = 1024
proc.curproc.rlimit.descriptors.hard = 3404

Hmm, on mamba's host I see

proc.curproc.rlimit.descriptors.soft = 128
proc.curproc.rlimit.descriptors.hard = 1772

I had actually tried building with unnamed semas there a couple days
ago, and found that the postmaster failed to start. 21fb39cb0 should
have alleviated that (didn't test it yet). But we're still in a
very limited-resource regime. That with the old performance tests
you dredged up makes me not want to switch sema types.

Yeah, I would not expend a lot of effort on this. But two one-line
changes doesn't seem unreasonable.

Agreed for stuff like SEMAS_PER_SET. I just don't think it's a good idea to
invest in lowering our default semaphore requirements by lowering various
default process limits or such.

Fair, seems like we're on the same page.

regards, tom lane

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#14)
Re: Regression tests fail on OpenBSD due to low semmns value

BTW, I did a little bit of performance testing using current OpenBSD
(7.6), and it looks like they partially fixed the performance issues
I saw with their named POSIX semaphores back in 2021. "pgbench -S"
seems to show TPS rates right about on par with a SysV-sema build.
There is still a measurable hit in connection startup time, about
18.8ms versus 16.7ms according to "pgbench -S -C" (with
max_connections set to 100). But that's probably not something
you'd notice if you weren't looking for it. Postmaster start/stop
time is still awful with max_connections = 10000, but how many
people are likely to try that? (It's a couple of seconds at 1000,
so I detect a strong whiff of an O(N^2) issue in there somewhere.)

So maybe we should think about switching OpenBSD to named semas
by default. One good thing about that is we'd have some buildfarm
coverage for that code path --- right now there are no platforms
that use it.

We'd still want to make the other changes I mentioned for NetBSD's
sake, though.

regards, tom lane

#16Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#15)
Re: Regression tests fail on OpenBSD due to low semmns value

Hello Tom,

16.12.2024 07:23, Tom Lane wrote:

Alexander Lakhin<exclusion@gmail.com> writes:

...
So GetSafeSnapshot() waits indefinitely for possibleUnsafeConflicts to
become empty (for other backend to remove itself from the list of possible conflicts
inside ReleasePredicateLocks()), but it doesn't happen.

This seems like an actual bug?

I've reproduced this behavior with two reduced sqls.
prepared_xacts.sql:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  CREATE TABLE pxtest4 (a int);
PREPARE TRANSACTION 'regress_sub2';
\c -
COMMIT PREPARED 'regress_sub2';
-- the script ends prematurely and doesn't reach COMMIT when \c fails due
-- to the "too many clients" error.

transactions.sql
SELECT pg_sleep(1);
CREATE TABLE writetest (a int);

BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
SELECT * FROM writetest; -- ok
COMMIT;

and parallel_schedule:
test: transactions prepared_xacts

So "transactions" backend just waits for the prepared transaction to
finish.

19.12.2024 01:06, Tom Lane wrote:

We'd still want to make the other changes I mentioned for NetBSD's
sake, though.

Thank you for fixing that shortcoming!

Best regards,
Alexander

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#16)
Re: Regression tests fail on OpenBSD due to low semmns value

Alexander Lakhin <exclusion@gmail.com> writes:

I've reproduced this behavior with two reduced sqls.
prepared_xacts.sql:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  CREATE TABLE pxtest4 (a int);
PREPARE TRANSACTION 'regress_sub2';
\c -
COMMIT PREPARED 'regress_sub2';
-- the script ends prematurely and doesn't reach COMMIT when \c fails due
-- to the "too many clients" error.

Hmm, okay. Not really a bug, or at least I don't see much we could
do about it.

It does seem odd that a prepared transaction --- which, at least
in theory, we should know won't do anything more --- can block
other serializable transactions. Maybe that could be improved,
but it sounds like a research project not a bug fix.

regards, tom lane