postmaster uses more CPU in 18 beta1 with io_method=io_uring

Started by MARK CALLAGHAN7 months ago20 messages

mdcallag@gmail.com

7 months ago

When measuring the time to create a connection, it is ~2.3X longer with
io_method=io_uring then with io_method=sync (6.9ms vs 3ms), and the
postmaster process uses ~3.5X more CPU to create connections.

The reproduction case so far is my usage of the Insert Benchmark on a large
server with 48 cores. I need to fix the benchmark client -- today it
creates ~1000 connections/s to run a monitoring query in between every 100
queries and the extra latency from connection create makes results worse
for one of the benchmark steps. While I can fix the benchmark client to
avoid this, I am curious about the extra latency in connection create.

I used "perf record -e cycles -F 333 -g -p $pidof_postmaster -- sleep 30"
but I have yet to find a big difference from the reports generated with
that for io_method=io_uring vs =sync. It shows that much time is spent in
the kernel dealing with the VM (page tables, etc).

The server runs Ubuntu 22.04.4. I compiled the Postgres 18beta1 release
from source via:
./configure --prefix=$pfx --enable-debug CFLAGS="-O2
-fno-omit-frame-pointer" --with-lz4 --with-liburing

Output from configure includes:
checking whether to build with liburing support... yes
checking for liburing... yes

io_uring support was installed via: sudo apt install liburing-dev and I
have 2.1-2build1
libc is Ubuntu GLIBC 2.35-0ubuntu3.10
gcc is 11.4.0

More performance info is here:
https://mdcallag.github.io/reports/25_06_01.pg.all.mem.hetz/all.html#summary

The config files I used only differ WRT io_method
* io_method=sync -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10b_c32r128
* io_method=workers -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10cw4_c32r128
* io_method=io_uring -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10d_c32r128

The symptoms are:
* ~20% reduction in point queries/s with io_method=io_uring vs =sync,
=workers or Postgres 17.4, and the issue here is not that SELECT
performance has changed, it is that my benchmark client sometimes creates
connections in between running queries and the new latency from that for
io_method=io_uring hurts throughput
* CPU/query and context switches /query are similar, with io_uring the
CPU/query might be ~4% larger

From sampled thread stacks of the postmaster when I use io_uring the common
stack is:
arch_fork,__GI__Fork,__libc_fork,fork_process,postmaster_child_launch,BackendStartup,ServerLoop,PostmasterMain,main

While the typical stack with io_method=sync is:
epoll_wait,WaitEventSetWaitBlock,WaitEventSetWait,ServerLoop,PostmasterMain,main

I run "ps" during each benchmark step and on example of what I see during a
point query benchmarks step (qp100.L2) with io_method=uring is below. The
benchmark step runs for 300 seconds.
---> from the start of the step
mdcallag 3762684 0.9 1.5 103027276 2031612 ? Ss 03:12 0:14
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg
---> from the end of the step
mdcallag 3762684 15.9 1.5 103027276 2031612 ? Rs 03:12 5:04
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

And from top I see:
---> with =io_uring
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
3762684 mdcallag 20 0 98.3g 1.9g 1.9g R 99.4 1.5 3:04.87
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

--> with =sync
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
2913673 mdcallag 20 0 98.3g 1.9g 1.9g S 28.3 1.5 0:54.13
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

The postmaster had used 0:14 (14 seconds) of CPU time by the start of the
benchmark step and 5:04 (304 seconds) by the end. For the same step with
io_method=sync it was 0:05 at the start and 1:27 at the end. So the
postmaster used ~290 seconds of cpu with =io_uring vs ~82 with =sync, which
is ~3.5X more CPU on the postmaster per connection attempt.

From vmstat what I see is that some of the rates (cs = context switches, us
= user CPU) are ~20% smaller with =io_uring, which is reasonable given that
the throughput is also ~20% smaller. But sy (system CPU) is not 20% smaller
because of the overhead from all of those calls to fork (or clone).

Avg rates from vmstat
cs us sy us+sy
492961 25.0 14.0 39.0 --> with =sync
401233 20.1 14.0 34.1 ---> with =io_uring

--
Mark Callaghan
mdcallag@gmail.com

MARK CALLAGHAN

mdcallag@gmail.com

7 months ago

In reply to: MARK CALLAGHAN (#1)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

The new overhead for creating connections when io_method=io_uring is also a
function of max_connections. I have been using the default (=100). But when
I increase it to =1000 then the time to create a connection almost triples.
That isn't a big surprise given the usage of TotalProcs here:
https://github.com/postgres/postgres/blob/REL_18_BETA1/src/backend/storage/aio/method_io_uring.c#L129

On Tue, Jun 3, 2025 at 12:24 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:

When measuring the time to create a connection, it is ~2.3X longer with
io_method=io_uring then with io_method=sync (6.9ms vs 3ms), and the
postmaster process uses ~3.5X more CPU to create connections.

The reproduction case so far is my usage of the Insert Benchmark on a
large server with 48 cores. I need to fix the benchmark client -- today it
creates ~1000 connections/s to run a monitoring query in between every 100
queries and the extra latency from connection create makes results worse
for one of the benchmark steps. While I can fix the benchmark client to
avoid this, I am curious about the extra latency in connection create.

I used "perf record -e cycles -F 333 -g -p $pidof_postmaster -- sleep 30"
but I have yet to find a big difference from the reports generated with
that for io_method=io_uring vs =sync. It shows that much time is spent in
the kernel dealing with the VM (page tables, etc).

The server runs Ubuntu 22.04.4. I compiled the Postgres 18beta1 release
from source via:
./configure --prefix=$pfx --enable-debug CFLAGS="-O2
-fno-omit-frame-pointer" --with-lz4 --with-liburing

Output from configure includes:
checking whether to build with liburing support... yes
checking for liburing... yes

io_uring support was installed via: sudo apt install liburing-dev and I
have 2.1-2build1
libc is Ubuntu GLIBC 2.35-0ubuntu3.10
gcc is 11.4.0

More performance info is here:
https://mdcallag.github.io/reports/25_06_01.pg.all.mem.hetz/all.html#summary

The config files I used only differ WRT io_method
* io_method=sync -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10b_c32r128
* io_method=workers -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10cw4_c32r128
* io_method=io_uring -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10d_c32r128

The symptoms are:
* ~20% reduction in point queries/s with io_method=io_uring vs =sync,
=workers or Postgres 17.4, and the issue here is not that SELECT
performance has changed, it is that my benchmark client sometimes creates
connections in between running queries and the new latency from that for
io_method=io_uring hurts throughput
* CPU/query and context switches /query are similar, with io_uring the
CPU/query might be ~4% larger

From sampled thread stacks of the postmaster when I use io_uring the
common stack is:

arch_fork,__GI__Fork,__libc_fork,fork_process,postmaster_child_launch,BackendStartup,ServerLoop,PostmasterMain,main

While the typical stack with io_method=sync is:

epoll_wait,WaitEventSetWaitBlock,WaitEventSetWait,ServerLoop,PostmasterMain,main

I run "ps" during each benchmark step and on example of what I see during
a point query benchmarks step (qp100.L2) with io_method=uring is below. The
benchmark step runs for 300 seconds.
---> from the start of the step
mdcallag 3762684 0.9 1.5 103027276 2031612 ? Ss 03:12 0:14
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg
---> from the end of the step
mdcallag 3762684 15.9 1.5 103027276 2031612 ? Rs 03:12 5:04
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

And from top I see:
---> with =io_uring
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
3762684 mdcallag 20 0 98.3g 1.9g 1.9g R 99.4 1.5 3:04.87
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

--> with =sync
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
2913673 mdcallag 20 0 98.3g 1.9g 1.9g S 28.3 1.5 0:54.13
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

The postmaster had used 0:14 (14 seconds) of CPU time by the start of the
benchmark step and 5:04 (304 seconds) by the end. For the same step with
io_method=sync it was 0:05 at the start and 1:27 at the end. So the
postmaster used ~290 seconds of cpu with =io_uring vs ~82 with =sync, which
is ~3.5X more CPU on the postmaster per connection attempt.

From vmstat what I see is that some of the rates (cs = context switches,
us = user CPU) are ~20% smaller with =io_uring, which is reasonable given
that the throughput is also ~20% smaller. But sy (system CPU) is not 20%
smaller because of the overhead from all of those calls to fork (or clone).

Avg rates from vmstat
cs us sy us+sy
492961 25.0 14.0 39.0 --> with =sync
401233 20.1 14.0 34.1 ---> with =io_uring

--
Mark Callaghan
mdcallag@gmail.com

--
Mark Callaghan
mdcallag@gmail.com

Andres Freund

andres@anarazel.de

7 months ago

In reply to: MARK CALLAGHAN (#1)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

On 2025-06-03 12:24:38 -0700, MARK CALLAGHAN wrote:

When measuring the time to create a connection, it is ~2.3X longer with
io_method=io_uring then with io_method=sync (6.9ms vs 3ms), and the
postmaster process uses ~3.5X more CPU to create connections.

I can reproduce that - the reason for the slowdown is that we create one
io_uring instance for each potential process, and the way we create them
creates one mmap()ed region for each potential process. That creates extra
overhead, particularly when child processes exit.

The reproduction case so far is my usage of the Insert Benchmark on a large
server with 48 cores. I need to fix the benchmark client -- today it
creates ~1000 connections/s to run a monitoring query in between every 100
queries and the extra latency from connection create makes results worse
for one of the benchmark steps.

Heh, yea - 1000/connections sec will influence performance regardless of this issue.

While I can fix the benchmark client to avoid this, I am curious about the
extra latency in connection create.

I used "perf record -e cycles -F 333 -g -p $pidof_postmaster -- sleep 30"
but I have yet to find a big difference from the reports generated with
that for io_method=io_uring vs =sync. It shows that much time is spent in
the kernel dealing with the VM (page tables, etc).

I see a lot of additional time spent below
do_group_exit->do_exit->...->unmap_vmas
which fits the theory that this is due to the number of memory mappings.

There has been a bunch of discussion around this on mastodon, particularly
below [1]https://fosstodon.org/@axboe/114630982449670090 which ended in Jens prototyping that approach [2]https://pastebin.com/7M3C8aFH where Jens pointed
out that we should use
https://man7.org/linux/man-pages/man3/io_uring_queue_init_mem.3.html to avoid
creating this many memory mappings.

There are a few complications around that though - only newer kernels (>=6.5)
support the caller providing the memory for the mapping and there isn't yet a
good way to figure out how much memory needs to be provided.

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Greetings,

Andres Freund

[1]: https://fosstodon.org/@axboe/114630982449670090
[2]: https://pastebin.com/7M3C8aFH

Tom Lane

tgl@sss.pgh.pa.us

7 months ago

In reply to: Andres Freund (#3)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

regards, tom lane

Andres Freund

andres@anarazel.de

7 months ago

In reply to: Tom Lane (#4)

1 attachment(s)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Greetings,

Andres

Attachments:

v1-0001-wip-aio-Combine-io_uring-memory-mappings-if-suppo.patchtext/x-diff; charset=us-asciiDownload

From aa740a18f2addffadc75defed05777f75dde0a6a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 5 Jun 2025 14:10:33 -0400
Subject: [PATCH v1] wip: aio: Combine io_uring memory mappings, if supported

Author:
Reviewed-by:
Discussion: https://postgr.es/m/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com
Backpatch:
---
 meson.build                               |   6 +
 configure.ac                              |   7 +
 src/include/pg_config.h.in                |   3 +
 src/backend/storage/aio/method_io_uring.c | 205 +++++++++++++++++++++-
 configure                                 |  17 ++
 src/tools/pgindent/typedefs.list          |   1 +
 6 files changed, 234 insertions(+), 5 deletions(-)

diff --git a/meson.build b/meson.build
index d142e3e408b..bc731362fb1 100644
--- a/meson.build
+++ b/meson.build
@@ -990,6 +990,12 @@ liburingopt = get_option('liburing')
 liburing = dependency('liburing', required: liburingopt)
 if liburing.found()
   cdata.set('USE_LIBURING', 1)
+
+  if cc.has_function('io_uring_queue_init_mem',
+      dependencies: liburing, args: test_c_args)
+    cdata.set('HAVE_LIBURING_QUEUE_INIT_MEM', 1)
+  endif
+
 endif
 
 
diff --git a/configure.ac b/configure.ac
index 4b8335dc613..14f485a453f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1420,6 +1420,13 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_liburing" = yes; then
+  _LIBS="$LIBS"
+  LIBS="$LIBURING_LIBS $LIBS"
+  AC_CHECK_FUNCS([io_uring_queue_init_mem])
+  LIBS="$_LIBS"
+fi
+
 if test "$with_lz4" = yes ; then
   AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
 fi
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 726a7c1be1f..c4dc5d72bdb 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -229,6 +229,9 @@
 /* Define to 1 if you have the global variable 'int timezone'. */
 #undef HAVE_INT_TIMEZONE
 
+/* Define to 1 if you have the `io_uring_queue_init_mem' function. */
+#undef HAVE_IO_URING_QUEUE_INIT_MEM
+
 /* Define to 1 if __builtin_constant_p(x) implies "i"(x) acceptance. */
 #undef HAVE_I_CONSTRAINT__BUILTIN_CONSTANT_P
 
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index cc312b641ca..bc7f0104d98 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -29,6 +29,7 @@
 
 #ifdef IOMETHOD_IO_URING_ENABLED
 
+#include <sys/mman.h>
 #include <liburing.h>
 
 #include "miscadmin.h"
@@ -94,12 +95,32 @@ PgAioUringContext
 	struct io_uring io_uring_ring;
 } PgAioUringContext;
 
+/*
+ * Information about the capabilities that io_uring has.
+ *
+ * Depending on liburing and kernel version different features are
+ * supported. At least for the kernel a kernel version check does not suffice
+ * as various vendors do backport features to older kernels :(.
+ */
+typedef struct PgAioUringCaps
+{
+	bool		checked;
+	/* -1 if io_uring_queue_init_mem() is unsupported */
+	int			mem_init_size;
+} PgAioUringCaps;
+
+
 /* PgAioUringContexts for all backends */
 static PgAioUringContext *pgaio_uring_contexts;
 
 /* the current backend's context */
 static PgAioUringContext *pgaio_my_uring_context;
 
+static PgAioUringCaps pgaio_uring_caps =
+{
+	.checked = false,
+	.mem_init_size = -1,
+};
 
 static uint32
 pgaio_uring_procs(void)
@@ -111,16 +132,144 @@ pgaio_uring_procs(void)
 	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
 }
 
+/*
+ * Initializes pgaio_uring_caps unless that's already done
+ */
+static void
+pgaio_uring_check_capabilities(void)
+{
+	if (pgaio_uring_caps.checked)
+		return;
+
+	/*
+	 * By default io_uring creates a shared memory mapping for io_uring
+	 * instance, leading to a large number of memory mappings. Unfortunately a
+	 * large number of memory mappings slows things down, backend exit is
+	 * particularly affected.  To address that newer kernels (6.5) support
+	 * using user-provided memory for the memory, by putting the relevant
+	 * memory into shared memory we don't need any additional mappings.
+	 *
+	 * To know whether this is supported we unfortunately need to probe the
+	 * kernel by trying to create a ring with userspace-provided memory. This
+	 * also has a secondary benefit: We can check precisely how much memory we
+	 * need for each io_uring instance.
+	 */
+#if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP) && 1
+	{
+		struct io_uring test_ring;
+		size_t		ring_size;
+		void	   *ring_ptr;
+		struct io_uring_params p = {0};
+		int			ret;
+
+		/*
+		 * Liburing does not yet provide an API to query how much memory a
+		 * ring will need. So we over-estimate it here. As the memory is freed
+		 * just below that's small temporary waste of memory.
+		 *
+		 * 1MB is more than enough for rings within io_max_concurrency's
+		 * range.
+		 */
+		ring_size = 1024 * 1024;
+
+		/*
+		 * Hard to believe a system exists where 1MB would not be a multiple
+		 * of the page size. But it's cheap to ensure...
+		 */
+		ring_size -= ring_size % sysconf(_SC_PAGESIZE);
+
+		ring_ptr = mmap(NULL, ring_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+		if (ring_ptr == MAP_FAILED)
+			elog(ERROR,
+				 "mmap(%zu) to determine io_uring_queue_init_mem() support has failed: %m",
+				 ring_size);
+
+		ret = io_uring_queue_init_mem(io_max_concurrency, &test_ring, &p, ring_ptr, ring_size);
+		if (ret > 0)
+		{
+			pgaio_uring_caps.mem_init_size = ret;
+			/* FIXME: This should probably not stay at DEBUG1? */
+			elog(DEBUG1,
+				 "can use combined memory mapping for io_uring, each ring needs %d bytes",
+				 ret);
+
+			/* clean up the created ring, it was just for a test */
+			io_uring_queue_exit(&test_ring);
+		}
+		else
+		{
+			/*
+			 * There are different reasons for ring creation to fail, but it's
+			 * ok to treat that just as io_uring_queue_init_mem() not being
+			 * supported. We'll report a more detailed error in
+			 * pgaio_uring_shmem_init().
+			 */
+			errno = -ret;
+			elog(DEBUG1,
+				 "cannot use combined memory mapping for io_uring, ring creation failed with %m");
+
+		}
+
+		if (munmap(ring_ptr, ring_size) != 0)
+			elog(ERROR, "munmap() failed: %m");
+	}
+#else
+	{
+		elog(DEBUG1,
+			 "can't use combined memory mapping for io_uring, kernel or liburing too old");
+	}
+#endif
+
+	pgaio_uring_caps.checked = true;
+}
+
+/*
+ * Memory for all PgAioUringContext instances
+ */
 static Size
 pgaio_uring_context_shmem_size(void)
 {
 	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
 }
 
+/*
+ * Memory for the combined memory used by io_uring instances. Returns 0 if
+ * that is not supported by kernel/liburing.
+ */
+static Size
+pgaio_uring_ring_shmem_size(void)
+{
+	size_t		sz = 0;
+
+	if (pgaio_uring_caps.mem_init_size > 0)
+	{
+		/*
+		 * Memory for rings needs to be allocated to the page boundary,
+		 * reserve space. Luckily it does not need to be aligned to hugepage
+		 * boundaries, even if huge pages are used.
+		 */
+		sz = add_size(sz, sysconf(_SC_PAGESIZE));
+		sz = add_size(sz, mul_size(pgaio_uring_procs(), pgaio_uring_caps.mem_init_size));
+	}
+
+	return sz;
+}
+
 static size_t
 pgaio_uring_shmem_size(void)
 {
-	return pgaio_uring_context_shmem_size();
+	size_t		sz;
+
+	/*
+	 * Kernel and liburing support for various features influences how much
+	 * shmem we need, perform the necessary checks.
+	 */
+	pgaio_uring_check_capabilities();
+
+	sz = pgaio_uring_context_shmem_size();
+	sz = add_size(sz, pgaio_uring_ring_shmem_size());
+
+	return sz;
 }
 
 static void
@@ -128,13 +277,38 @@ pgaio_uring_shmem_init(bool first_time)
 {
 	int			TotalProcs = pgaio_uring_procs();
 	bool		found;
+	char	   *shmem;
+	size_t		ring_mem_remain = 0;
+	char	   *ring_mem_next = 0;
 
-	pgaio_uring_contexts = (PgAioUringContext *)
-		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
-
+	/*
+	 * XXX: We allocate memory for all PgAioUringContext instances and, if
+	 * supported, the memory required for each of the io_uring instances, in
+	 * one ShmemInitStruct(). Should we instead separate the latter into a
+	 * separate ShmemInitStruct()?
+	 */
+	shmem = ShmemInitStruct("AioUringContext", pgaio_uring_shmem_size(), &found);
 	if (found)
 		return;
 
+	pgaio_uring_contexts = (PgAioUringContext *) shmem;
+	shmem += pgaio_uring_context_shmem_size();
+
+	if (pgaio_uring_caps.mem_init_size > 0)
+	{
+		ring_mem_remain = pgaio_uring_ring_shmem_size();
+		ring_mem_next = (char *) shmem;
+
+		/* align to page boundary, see also pgaio_uring_ring_shmem_size() */
+		ring_mem_next = (char *) TYPEALIGN(sysconf(_SC_PAGESIZE), ring_mem_next);
+
+		/* account for alignment */
+		ring_mem_remain -= ring_mem_next - shmem;
+		shmem += ring_mem_next - shmem;
+
+		shmem += ring_mem_remain;
+	}
+
 	for (int contextno = 0; contextno < TotalProcs; contextno++)
 	{
 		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
@@ -158,7 +332,28 @@ pgaio_uring_shmem_init(bool first_time)
 		 * be worth using that - also need to evaluate if that causes
 		 * noticeable additional contention?
 		 */
-		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+
+		/*
+		 * If supported (c.f. pgaio_uring_check_capabilities()), create ring
+		 * with its data in shared memory. Otherwise fall back io_uring
+		 * creating a memory mapping for each ring.
+		 */
+#if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP)
+		if (pgaio_uring_caps.mem_init_size > 0)
+		{
+			struct io_uring_params p = {0};
+
+			ret = io_uring_queue_init_mem(io_max_concurrency, &context->io_uring_ring, &p, ring_mem_next, ring_mem_remain);
+
+			ring_mem_remain -= ret;
+			ring_mem_next += ret;
+		}
+		else
+#endif
+		{
+			ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		}
+
 		if (ret < 0)
 		{
 			char	   *hint = NULL;
diff --git a/configure b/configure
index 4f15347cc95..5ae8edee079 100755
--- a/configure
+++ b/configure
@@ -13309,6 +13309,23 @@ fi
 
 fi
 
+if test "$with_liburing" = yes; then
+  _LIBS="$LIBS"
+  LIBS="$LIBURING_LIBS $LIBS"
+  for ac_func in io_uring_queue_init_mem
+do :
+  ac_fn_c_check_func "$LINENO" "io_uring_queue_init_mem" "ac_cv_func_io_uring_queue_init_mem"
+if test "x$ac_cv_func_io_uring_queue_init_mem" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_IO_URING_QUEUE_INIT_MEM 1
+_ACEOF
+
+fi
+done
+
+  LIBS="$_LIBS"
+fi
+
 if test "$with_lz4" = yes ; then
   { $as_echo "$as_me:${as_lineno-$LINENO}: checking for LZ4_compress_default in -llz4" >&5
 $as_echo_n "checking for LZ4_compress_default in -llz4... " >&6; }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8346cda633..83299ccc427 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2171,6 +2171,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringCaps
 PgAioUringContext
 PgAioWaitRef
 PgArchData
-- 
2.48.1.76.g4e746b1a31.dirty

Nathan Bossart

nathandbossart@gmail.com

7 months ago

In reply to: Tom Lane (#4)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

On Thu, Jun 05, 2025 at 12:47:52PM -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

+1, I see no point in waiting for v19, especially since all of this stuff
is new in v18, anyway.

--
nathan

Andres Freund

andres@anarazel.de

7 months ago

In reply to: Andres Freund (#5)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Any comments on that patch? I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

Greetings,

Andres

Jim Nasby

jnasby@upgrade.com

7 months ago

In reply to: Andres Freund (#7)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

+#if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP)
&& 1

Is that && 1 intentional?

Nit:
+ "mmap(%zu) to determine io_uring_queue_init_mem() support has failed: %m",
IMHO that would read better without "has".

+ /* FIXME: This should probably not stay at DEBUG1? */
+ elog(DEBUG1,
+ "can use combined memory mapping for io_uring, each ring needs %d bytes",
+ ret);
Assuming my read that this is only executed at postmaster start is correct,
I agree that NOTICE would also be reasonable. Though I'm not sure what a
user could actually do with the info...

+ elog(DEBUG1,
+ "can't use combined memory mapping for io_uring, kernel or liburing too
old");
OTOH this message would definitely be of interest to users; I'd say it
should at least be NOTICE, possibly even WARNING. It'd also be good to have
a HINT either explaining the downside or pointing to the docs.

+ * Memory for rings needs to be allocated to the page boundary,
+ * reserve space. Luckily it does not need to be aligned to hugepage
+ * boundaries, even if huge pages are used.
Is "reserve space" left over from something else?
AFAICT pgaio_uring_ring_shmem_size() isn't even reserving space...

On Mon, Jun 30, 2025 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming

the patch

has a sensible complexity, worth fixing this in 18. RMT, anyone,

what do you

think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the

function

test is probably in the wrong place.

Any comments on that patch? I'd hoped for some review comments... Unless
I'll
hear otherwise, I'll just do a bit more polish and push..

Greetings,

Andres

Burd, Greg

greg@burd.me

7 months ago

In reply to: Andres Freund (#7)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

On Jun 30, 2025, at 12:27 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Any comments on that patch? I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

Thanks for doing this work!

I just read through the v1 patch and it looks good. I have just a few small nit-picky questions:

+ #if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP) && 1

The '1' looks like cruft, or am I missing something?

+ /* FIXME: This should probably not stay at DEBUG1? */

Worth fixing before pushing?

Also, this returns 'Size' but in the function uses 'size_t' I assume that's intentional?

+ static Size
+ pgaio_uring_ring_shmem_size(void)

The next, similar, function below this one returns 'size_t'.

Finally, and this may be me missing something everyone else knows is convention.

+ * XXX: We allocate memory for all PgAioUringContext instances and, if

Is there any reason to keep the 'XXX'? You ask yourself a question in that comment, do you know the answer or was that a request to reviewers for feedback? :)

I hope that is helpful.

-greg

Show quoted text

Greetings,

Andres

#10

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Burd, Greg (#9)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

On 2025-06-30 15:31:14 -0400, Burd, Greg wrote:

On Jun 30, 2025, at 12:27 PM, Andres Freund <andres@anarazel.de> wrote:
On 2025-06-05 14:32:10 -0400, Andres Freund wrote:

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Any comments on that patch? I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

Thanks for doing this work!

I just read through the v1 patch and it looks good. I have just a few small nit-picky questions:

+ #if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP) && 1

The '1' looks like cruft, or am I missing something?

It's for making it easy to test both paths when running on an kernel/liburing
combo that's new enought o have support.

+ /* FIXME: This should probably not stay at DEBUG1? */

Worth fixing before pushing?

Yes. I was just not yet sure what it should be. I ended up concluding that
it's probably fine to just keep it at DEBUG1...

Also, this returns 'Size' but in the function uses 'size_t' I assume that's intentional?

+ static Size
+ pgaio_uring_ring_shmem_size(void)

The next, similar, function below this one returns 'size_t'.

You're right - I wish we would just do a (slightly smarter) version of
s/Size/size_t/...

Finally, and this may be me missing something everyone else knows is convention.

+ * XXX: We allocate memory for all PgAioUringContext instances and, if

Is there any reason to keep the 'XXX'? You ask yourself a question in that
comment, do you know the answer or was that a request to reviewers for
feedback? :)

A bit of both :). I concluded that it's not worth having a separate segment,
there's not enough memory here to matter...

I hope that is helpful.

Yep!

Greetings,

Andres Freund

#11

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Jim Nasby (#8)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

On 2025-06-30 13:57:28 -0500, Jim Nasby wrote:

+#if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP)
&& 1

Is that && 1 intentional?

It was for testing both branches...

Nit:
+ "mmap(%zu) to determine io_uring_queue_init_mem() support has failed: %m",
IMHO that would read better without "has".

Agreed, fixed.

+ /* FIXME: This should probably not stay at DEBUG1? */
+ elog(DEBUG1,
+ "can use combined memory mapping for io_uring, each ring needs %d bytes",
+ ret);
Assuming my read that this is only executed at postmaster start is correct,
I agree that NOTICE would also be reasonable. Though I'm not sure what a
user could actually do with the info...

I was thinking of *lowering* it, given that the user, as you point out, can't
do much with the information.

+ elog(DEBUG1,
+ "can't use combined memory mapping for io_uring, kernel or liburing too
old");
OTOH this message would definitely be of interest to users; I'd say it
should at least be NOTICE, possibly even WARNING.

I don't think it's worth it - typically the user won't be able to do much,
given that just upgrading the kernel is rarely easily possible.

It'd also be good to have a HINT either explaining the downside or pointing
to the docs.

I don't know about that - outside of extreme cases the performance effects
really aren't that meaningful. E.g. compiling with openssl support also has
connection establishment performance overhead, yet we don't document that
anywhere either, even though it's present even with ssl=off.

+ * Memory for rings needs to be allocated to the page boundary,
+ * reserve space. Luckily it does not need to be aligned to hugepage
+ * boundaries, even if huge pages are used.
Is "reserve space" left over from something else?

No, it's trying to say that this is reserving space for alignment.

AFAICT pgaio_uring_ring_shmem_size() isn't even reserving space...

That's all it does? It's used for sizing the shared memory allocation...

Greetings,

Andres Freund

#12

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Andres Freund (#7)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

On 2025-06-30 12:27:10 -0400, Andres Freund wrote:

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Any comments on that patch? I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
to increase the log level as Jim suggested, but if we end up deciding that
that's the way to go, we can easily change that...

Greetings,

Andres

#13

Jakub Wartak

jakub.wartak@enterprisedb.com

5 months ago

In reply to: Andres Freund (#12)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

On Tue, Jul 8, 2025 at 5:22 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-06-30 12:27:10 -0400, Andres Freund wrote:

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Any comments on that patch? I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
to increase the log level as Jim suggested, but if we end up deciding that
that's the way to go, we can easily change that...

Hi Andres,

I'm with Jim as I've just hit it but not on exit() but for fork(), so:

1. Could we s/DEBUG1/INFO/ that debug message level? (for those two:
"cannot use combined memory mapping for io_uring" , and maybe add
"potential slow new connections" there too along the way?)
2. Maybe we could add some wording to the docs about io_method that it
might cause such trouble ?

Just wasted an hour on wondering why $stuff is slow, given:
max_connections = '20000' # yes, yay..
io_method = 'io_uring'

I was getting like slow fork()/clone() performance when there's were
lots of io_uring fds/instances in the main postmaster:
$ /usr/pgsql19/bin/pgbench -f select1.sql -c 1000 -j 1 -t 1 -P 1
[..]
progress: 39.7 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 40.6 s, 1039.9 tps, lat 407.696 ms stddev 291.856, 0 failed
[..]
initial connection time = 39632.164 ms
tps = 1015.608893 (without initial connection time)

So yes, ~40s to just connect to the database and I was using some old
branch from back before Jun (it was not having f54af9f2679d5987b46),
so simulating <= 6.5 as You say more or less. I was limited to 20-30
forks()/1sec according to bpftrace. It goes away with default
io_method (~800 forks()/1sec). With max_connections = 2k, I got 5s
initial connection times. It looked like caused by io_uring, as with
io_uring fork() was slow somewhere in vma_interval_tree_insert_after
<- copy_process <- kernel_clone <- __do_sys_clone <- do_syscall_64
(?). I've tested it on 6.14.17 too, but also on LTS 6.1.x too (well
the difference is that it takes 65s instead of 40s...). Then searched
and hit this thread, but 6.1 is the LTS kernel, so plenty of people
are going to hit those regressions with io_uring io_method, won't
they?

I can try to prepare a patch, please just let me know.

-J.

#14

Robert Treat

rob@xzilla.net

4 months ago

In reply to: Jakub Wartak (#13)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

On Tue, Aug 26, 2025 at 9:32 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Tue, Jul 8, 2025 at 5:22 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-06-30 12:27:10 -0400, Andres Freund wrote:

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Any comments on that patch? I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
to increase the log level as Jim suggested, but if we end up deciding that
that's the way to go, we can easily change that...

Hi Andres,

I'm with Jim as I've just hit it but not on exit() but for fork(), so:

1. Could we s/DEBUG1/INFO/ that debug message level? (for those two:
"cannot use combined memory mapping for io_uring" , and maybe add
"potential slow new connections" there too along the way?)
2. Maybe we could add some wording to the docs about io_method that it
might cause such trouble ?

Just wasted an hour on wondering why $stuff is slow, given:
max_connections = '20000' # yes, yay..
io_method = 'io_uring'

I was getting like slow fork()/clone() performance when there's were
lots of io_uring fds/instances in the main postmaster:
$ /usr/pgsql19/bin/pgbench -f select1.sql -c 1000 -j 1 -t 1 -P 1
[..]
progress: 39.7 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 40.6 s, 1039.9 tps, lat 407.696 ms stddev 291.856, 0 failed
[..]
initial connection time = 39632.164 ms
tps = 1015.608893 (without initial connection time)

So yes, ~40s to just connect to the database and I was using some old
branch from back before Jun (it was not having f54af9f2679d5987b46),
so simulating <= 6.5 as You say more or less. I was limited to 20-30
forks()/1sec according to bpftrace. It goes away with default
io_method (~800 forks()/1sec). With max_connections = 2k, I got 5s
initial connection times. It looked like caused by io_uring, as with
io_uring fork() was slow somewhere in vma_interval_tree_insert_after
<- copy_process <- kernel_clone <- __do_sys_clone <- do_syscall_64
(?). I've tested it on 6.14.17 too, but also on LTS 6.1.x too (well
the difference is that it takes 65s instead of 40s...). Then searched
and hit this thread, but 6.1 is the LTS kernel, so plenty of people
are going to hit those regressions with io_uring io_method, won't
they?

I can try to prepare a patch, please just let me know.

Did anything ever happen with this? I do think it would be helpful to
make some of these pot-holes more user visible / discoverable. I have
a suspicion that we're going to see people using pre-built packages
with io_uring support installed on to older kernels they are still
hanging on to because pg_upgrade was the easiest path, but that they
could either update the kernel or upgrade via logical replication to
get the new functionality if they knew about it.

Robert Treat
https://xzilla.net

#15

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Robert Treat (#14)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

On 2025-09-06 09:12:19 -0400, Robert Treat wrote:

On Tue, Aug 26, 2025 at 9:32 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Tue, Jul 8, 2025 at 5:22 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-06-30 12:27:10 -0400, Andres Freund wrote:
After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
to increase the log level as Jim suggested, but if we end up deciding that
that's the way to go, we can easily change that...

I'm with Jim as I've just hit it but not on exit() but for fork(), so:

1. Could we s/DEBUG1/INFO/ that debug message level? (for those two:
"cannot use combined memory mapping for io_uring" , and maybe add
"potential slow new connections" there too along the way?)
2. Maybe we could add some wording to the docs about io_method that it
might cause such trouble ?

Just wasted an hour on wondering why $stuff is slow, given:
max_connections = '20000' # yes, yay..
io_method = 'io_uring'

I was getting like slow fork()/clone() performance when there's were
lots of io_uring fds/instances in the main postmaster:
$ /usr/pgsql19/bin/pgbench -f select1.sql -c 1000 -j 1 -t 1 -P 1
[..]
progress: 39.7 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 40.6 s, 1039.9 tps, lat 407.696 ms stddev 291.856, 0 failed
[..]
initial connection time = 39632.164 ms
tps = 1015.608893 (without initial connection time)

So yes, ~40s to just connect to the database and I was using some old
branch from back before Jun (it was not having f54af9f2679d5987b46),
so simulating <= 6.5 as You say more or less. I was limited to 20-30
forks()/1sec according to bpftrace. It goes away with default
io_method (~800 forks()/1sec). With max_connections = 2k, I got 5s
initial connection times. It looked like caused by io_uring, as with
io_uring fork() was slow somewhere in vma_interval_tree_insert_after
<- copy_process <- kernel_clone <- __do_sys_clone <- do_syscall_64
(?). I've tested it on 6.14.17 too, but also on LTS 6.1.x too (well
the difference is that it takes 65s instead of 40s...). Then searched
and hit this thread, but 6.1 is the LTS kernel, so plenty of people
are going to hit those regressions with io_uring io_method, won't
they?

I doubt it, but who knows.

I can try to prepare a patch, please just let me know.

Yes, please do.

Did anything ever happen with this?

No. I missed the email. So thanks for the reminder.

I do think it would be helpful to make some of these pot-holes more user
visible / discoverable.

I have a suspicion that we're going to see people using pre-built packages
with io_uring support installed on to older kernels they are still hanging
on to because pg_upgrade was the easiest path, but that they could either
update the kernel or upgrade via logical replication to get the new
functionality if they knew about it.

If they just upgrade in-place, they won't use io_uring. And they won't simply
use io_uring with this large max_connections without also tuning the file
descriptor limits...

Greetings,

Andres Freund

#16

Jakub Wartak

jakub.wartak@enterprisedb.com

4 months ago

In reply to: Andres Freund (#15)

1 attachment(s)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi Andres / Robert,

On Mon, Sep 8, 2025 at 5:55 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-09-06 09:12:19 -0400, Robert Treat wrote:

[..]

[..], but 6.1 is the LTS kernel, so plenty of people
are going to hit those regressions with io_uring io_method, won't
they?

I doubt it, but who knows.

RHEL 8.x won't have it (RH KB [1] says "RHEL 8.x: The addition to
RHEL8 was being tracked in private Bug 1881561 - Add io_uring support.
Unfortunately, it has been decided that io_uring support will not be
enabled in RHEL8."

RHEL 9.x seems to be all based on 5.14.x (so much below 6.5.x) and
states that uring is in Tech Preview there and is disabled, but it can
be enabled via sysctl. Hard to tell what they will backpatch into
5.14.x there. So if anywhere, I would speculate it would be RHEL9 (?),
therefore 5.14.x (+their custom back patches).

I can try to prepare a patch, please just let me know.

Yes, please do.

Attached.

If they just upgrade in-place, they won't use io_uring. And they won't simply
use io_uring with this large max_connections without also tuning the file
descriptor limits...

Business as usual, just another obstacle...

-J.

Attachments:

v1-0001-aio-warn-user-if-combined-io_uring-memory-mapping.patchapplication/octet-stream; name=v1-0001-aio-warn-user-if-combined-io_uring-memory-mapping.patchDownload

From ad7c856e964b614507a06342c2acbf10bfa4855c Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Tue, 9 Sep 2025 14:30:48 +0200
Subject: [PATCH v1] aio: warn user if combined io_uring memory mappings are
 unavailable

In f54af9f2 we have added solution to avoid connection and disconnection hit
caused by io_uring managing large number of memory mappings. Unfortunately
it is available only on more modern Linux kernels (6.5) therefore notify user
in visible way if this optimization is not available.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by:
Discussion: https://postgr.es/m/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com
---
 doc/src/sgml/config.sgml                  |  6 ++++++
 src/backend/storage/aio/method_io_uring.c | 14 ++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2a3685f474a..9d541999dc1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2784,6 +2784,12 @@ include_dir 'conf.d'
         <para>
          This parameter can only be set at server start.
         </para>
+        <para>
+         Note that for optimum performance with <literal>io_uring</literal>
+         Linux kernel version >= 6.5 is recommended, as it provides way to
+         reduce the number of additional memory mappings which may negatively
+         affect the efficiency of establishing and terminating connections.
+        </para>
        </listitem>
       </varlistentry>
 
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index bb06da63a8e..5cd839df2f3 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -207,8 +207,11 @@ pgaio_uring_check_capabilities(void)
 			 * pgaio_uring_shmem_init().
 			 */
 			errno = -ret;
-			elog(DEBUG1,
-				 "cannot use combined memory mapping for io_uring, ring creation failed: %m");
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("cannot use combined memory mapping for io_uring, ring creation failed: %m"),
+					 errdetail("Connection and disconnection rates and efficiency may be degraded."),
+					 errhint("Ensure that you are running kernel >= 6.5")));
 
 		}
 
@@ -217,8 +220,11 @@ pgaio_uring_check_capabilities(void)
 	}
 #else
 	{
-		elog(DEBUG1,
-			 "can't use combined memory mapping for io_uring, kernel or liburing too old");
+		ereport(WARNING,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot use combined memory mapping for io_uring, kernel or liburing too old"),
+				 errdetail("Connection and disconnection rates and efficiency may be degraded."),
+				 errhint("Ensure that you are running kernel >= 6.5")));
 	}
 #endif
 
-- 
2.39.5

#17

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Jakub Wartak (#16)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi,

From ad7c856e964b614507a06342c2acbf10bfa4855c Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Tue, 9 Sep 2025 14:30:48 +0200
Subject: [PATCH v1] aio: warn user if combined io_uring memory mappings are
unavailable

In f54af9f2 we have added solution to avoid connection and disconnection hit
caused by io_uring managing large number of memory mappings. Unfortunately
it is available only on more modern Linux kernels (6.5) therefore notify user
in visible way if this optimization is not available.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by:
Discussion: /messages/by-id/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com
---
doc/src/sgml/config.sgml | 6 ++++++
src/backend/storage/aio/method_io_uring.c | 14 ++++++++++----
2 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2a3685f474a..9d541999dc1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2784,6 +2784,12 @@ include_dir 'conf.d'
<para>
This parameter can only be set at server start.
</para>
+        <para>
+         Note that for optimum performance with <literal>io_uring</literal>
+         Linux kernel version >= 6.5 is recommended, as it provides way to
+         reduce the number of additional memory mappings which may negatively
+         affect the efficiency of establishing and terminating connections.
+        </para>
</listitem>
</varlistentry>

This seems too low-level for end user docs, while not explaining that the
impact is due to a high max_connections value, rather than a large number of
actually established connections. How about something like

Note that for optimal performance with <literal>io_uring</literal> Linux
kernel version >= 6.5 is recommended. Older Linux versions, high values
of <xref linkend="guc-max-connections"/> will slow down connection
establishment and termination.

diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index bb06da63a8e..5cd839df2f3 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -207,8 +207,11 @@ pgaio_uring_check_capabilities(void)
* pgaio_uring_shmem_init().
*/
errno = -ret;
-			elog(DEBUG1,
-				 "cannot use combined memory mapping for io_uring, ring creation failed: %m");
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("cannot use combined memory mapping for io_uring, ring creation failed: %m"),
+					 errdetail("Connection and disconnection rates and efficiency may be degraded."),
+					 errhint("Ensure that you are running kernel >= 6.5")));

To me this seems too verbose, particularly because the majority of users
encountering it have zero chance to address the issue. And it's not like most
real world workloads are particularly affected, if you run with
max_connections=20k and have 100/connections second, you'll have a *lot* of
other problems.

Here's the full log of a start with the fallback branch forced:

2025-09-21 12:20:49.666 EDT [4090828][postmaster][:0][] WARNING: cannot use combined memory mapping for io_uring, ring creation failed: Unknown error -8192
2025-09-21 12:20:49.666 EDT [4090828][postmaster][:0][] DETAIL: Connection and disconnection rates and efficiency may be degraded.
2025-09-21 12:20:49.666 EDT [4090828][postmaster][:0][] HINT: Ensure that you are running kernel >= 6.5
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-15.2.0, 64-bit
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG: listening on IPv6 address "::1", port 5440
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG: listening on IPv4 address "127.0.0.1", port 5440
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG: listening on Unix socket "/tmp/.s.PGSQL.5440"
2025-09-21 12:20:49.712 EDT [4090831][startup][:0][] LOG: database system was shut down at 2025-09-21 12:20:42 EDT
2025-09-21 12:20:49.717 EDT [4090828][postmaster][:0][] LOG: database system is ready to accept connections

Close to half the lines are the new warning.

Greetings,

Andres Freund

#18

Jakub Wartak

jakub.wartak@enterprisedb.com

4 months ago

In reply to: Andres Freund (#17)

1 attachment(s)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Hi Andres,

On Sun, Sep 21, 2025 at 6:29 PM Andres Freund <andres@anarazel.de> wrote:
[..]

This seems too low-level for end user docs, while not explaining that the
impact is due to a high max_connections value, rather than a large number of
actually established connections. How about something like

Note that for optimal performance with <literal>io_uring</literal> Linux
kernel version >= 6.5 is recommended. Older Linux versions, high values
of <xref linkend="guc-max-connections"/> will slow down connection
establishment and termination.

Agreed, attached v2. Just one nitpick -- wouldn't '>> On << older
Linux versions ' sound better there?

[..v1 patch]

To me this seems too verbose, particularly because the majority of users
encountering it have zero chance to address the issue. And it's not like most
real world workloads are particularly affected, if you run with
max_connections=20k and have 100/connections second, you'll have a *lot* of
other problems.

Here's the full log of a start with the fallback branch forced:

[..]

Close to half the lines are the new warning.

I see two paths forward:

1. either we make it shorter, but I do not know if a multi-sentence
error message isn't against some project's policy? Feel free to
readjust as necessary, I'm not strongly attached to the exact wording
, just to hint people.
2. maybe we could emit the warning only in certain criteria, like
if(max_connections>1000) for example. However Mark (OP) reported it
even for the value of 100 so it seems we should warn about it like
always? (and it deteriorated 3x for him @ 1000 max_connections), so
it's like opening a new can of worms (to establish a proper
threshold).

Anyway attached v2 generates:

2025-09-22 09:56:21.123 CEST [12144] WARNING: io_uring combined
memory mapping creation failed: Unknown error -8192. Upgrade kernel to
6.5+ for improved performance
2025-09-22 09:56:21.179 CEST [12144] LOG: starting PostgreSQL 19devel
on x86_64-linux, compiled by clang-16.0.6, 64-bit
2025-09-22 09:56:21.180 CEST [12144] LOG: listening on IPv6 address
"::1", port 1236
2025-09-22 09:56:21.180 CEST [12144] LOG: listening on IPv4 address
"127.0.0.1", port 1236
2025-09-22 09:56:21.185 CEST [12144] LOG: listening on Unix socket
"/tmp/.s.PGSQL.1236"
2025-09-22 09:56:21.197 CEST [12147] LOG: database system was shut
down at 2025-09-22 09:55:44 CEST
2025-09-22 09:56:21.207 CEST [12144] LOG: database system is ready to
accept connections

BTW: on RHEL/derivatives it was possible to push people in certain
critical conditions into using kernel-lt/kernel-ml (but that's from
EPEL repos) , so it's not that they do not have space for maneuver.

-J.

Attachments:

v2-0001-aio-warn-user-if-combined-io_uring-memory-mapping.patchapplication/x-patch; name=v2-0001-aio-warn-user-if-combined-io_uring-memory-mapping.patchDownload

From 72cd5da67c5c60d150ac4780acf8c3d81323b810 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Mon, 22 Sep 2025 10:00:47 +0200
Subject: [PATCH v2] aio: warn user if combined io_uring memory mappings are
 unavailable

In f54af9f2 we have added solution to avoid connection and disconnection hit
caused by io_uring managing large number of memory mappings. Unfortunately
it is available only on more modern Linux kernels (6.5) therefore notify user
in visible way if this optimization is not available.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com
---
 doc/src/sgml/config.sgml                  |  6 ++++++
 src/backend/storage/aio/method_io_uring.c | 10 ++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e9b420f3ddb..15dd955a0c3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2784,6 +2784,12 @@ include_dir 'conf.d'
         <para>
          This parameter can only be set at server start.
         </para>
+        <para>
+         Note that for optimum performance with <literal>io_uring</literal>
+         Linux kernel version >= 6.5 is recommended. Older Linux versions,
+         high values of <xref linkend="guc-max-connections"/> will slow down connection
+         establishment and termination.
+        </para>
        </listitem>
       </varlistentry>
 
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index bb06da63a8e..36b9fabf7c5 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -207,8 +207,9 @@ pgaio_uring_check_capabilities(void)
 			 * pgaio_uring_shmem_init().
 			 */
 			errno = -ret;
-			elog(DEBUG1,
-				 "cannot use combined memory mapping for io_uring, ring creation failed: %m");
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("io_uring combined memory mapping creation failed: %m. Upgrade kernel to 6.5+ for improved performance")));
 
 		}
 
@@ -217,8 +218,9 @@ pgaio_uring_check_capabilities(void)
 	}
 #else
 	{
-		elog(DEBUG1,
-			 "can't use combined memory mapping for io_uring, kernel or liburing too old");
+		ereport(WARNING,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("io_uring combined memory mapping creation failed: %m. Upgrade kernel to 6.5+ for improved performance")));
 	}
 #endif
 
-- 
2.39.5

#19

Tomas Vondra

tomas@vondra.me

2 months ago

In reply to: Jakub Wartak (#18)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

On 9/22/25 10:45, Jakub Wartak wrote:

...
I see two paths forward:

1. either we make it shorter, but I do not know if a multi-sentence
error message isn't against some project's policy? Feel free to
readjust as necessary, I'm not strongly attached to the exact wording
, just to hint people.
2. maybe we could emit the warning only in certain criteria, like
if(max_connections>1000) for example. However Mark (OP) reported it
even for the value of 100 so it seems we should warn about it like
always? (and it deteriorated 3x for him @ 1000 max_connections), so
it's like opening a new can of worms (to establish a proper
threshold).

Anyway attached v2 generates:

2025-09-22 09:56:21.123 CEST [12144] WARNING: io_uring combined
memory mapping creation failed: Unknown error -8192. Upgrade kernel to
6.5+ for improved performance
2025-09-22 09:56:21.179 CEST [12144] LOG: starting PostgreSQL 19devel
on x86_64-linux, compiled by clang-16.0.6, 64-bit
2025-09-22 09:56:21.180 CEST [12144] LOG: listening on IPv6 address
"::1", port 1236
2025-09-22 09:56:21.180 CEST [12144] LOG: listening on IPv4 address
"127.0.0.1", port 1236
2025-09-22 09:56:21.185 CEST [12144] LOG: listening on Unix socket
"/tmp/.s.PGSQL.1236"
2025-09-22 09:56:21.197 CEST [12147] LOG: database system was shut
down at 2025-09-22 09:55:44 CEST
2025-09-22 09:56:21.207 CEST [12144] LOG: database system is ready to
accept connections

BTW: on RHEL/derivatives it was possible to push people in certain
critical conditions into using kernel-lt/kernel-ml (but that's from
EPEL repos) , so it's not that they do not have space for maneuver.

The v2 patch got no response for 1+ month, it seems. I see it adds info
to two places - sgml docs and elog().

I'm skeptical about the elog() changes. Maybe the log level change would
be good? But as Andres pointed out the people seeing this may not have a
chance to address the issue.

I don't think we should add references to particular kernel version into
our log messages. We'd need to make sure it does not get stale, stuff
may get backpatched (even if it's unlikely in this case), and so on.
There are likely plenty other places where the behavior (or performance)
depends on the kernel version - so why would this particular case be
special? I just don't see this as very helpful.

I'm not sure about adding the exact kernel version to the docs either.
There's exactly two references to a particular kernel version (and
that's pg_combinebackup/pg_upgrade referencing to 4.5). There's an
implicit understanding that newer kernel versions are faster, especially
for recently introduced features (like io_uring).

It'd probably make sense to have a section about io_uring tuning in
general. Maybe it should mention even the kernel version, not sure. But
it could mention other related stuff, like the need to increase the file
descriptor limit, etc.

regards

--
Tomas Vondra

#20

Jakub Wartak

jakub.wartak@enterprisedb.com

2 months ago

In reply to: Tomas Vondra (#19)

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

On Fri, Nov 7, 2025 at 11:50 PM Tomas Vondra <tomas@vondra.me> wrote:

On 9/22/25 10:45, Jakub Wartak wrote:

The v2 patch got no response for 1+ month, it seems. I see it adds info
to two places - sgml docs and elog().

I'm skeptical about the elog() changes. Maybe the log level change would
be good? But as Andres pointed out the people seeing this may not have a
chance to address the issue.

Hi Tomas,

You both expressed the same concern (users) "may not have a chance to
address the issue" . How's that? Users can disable uring, update OS,
kernel and so on.

If I were DBA in the field, I would want to get *any* sort of warning
that connection times are going to be impacted after setting
io_method=uring (WARNING or INFO, doesnt matter, certainly not DEBUG).
IOPS increase probably does not justify the impact to apps with high
values of high max_connections (app connection latency is also often
top concern). Even OP (Mark) wrote "but when I increase it to =1000
then the time to create a connection almost triples" (and people use
much bigger ones)

We can remove kernel version wording, sure no problem. BTW I'm also a
user and spotted the consequences of lack of this
io_uring_queue_init_mem(3) working on three separate occasions (sic! I
have a short memory), just to realize it's about uring because I've
set up a cluster some time ago and didn't connect the dots after a
while...

today it is:
elog(DEBUG1, "cannot use combined memory mapping for io_uring, ring
creation failed: %m");
elog(DEBUG1, "can't use combined memory mapping for io_uring, kernel
or liburing too old");

So maybe let's just go to the basics and increase DEBUG1->
INFO/NOTICE/WARNING (whatever) and that's all here?

-J.