Allow io_combine_limit up to 1MB

Started by Thomas Munro11 months ago15 messages
#1Thomas Munro
thomas.munro@gmail.com
1 attachment(s)

Hi,

Tomas queried[1]/messages/by-id/85696b8e-f1bf-459e-ba97-5608c644c185@vondra.me the limit of 256kB (or really 32 blocks) for
io_combine_limit. Yeah, I think we should increase it and allow
experimentation with larger numbers. Note that real hardware and
protocols have segment and size limits that can force the kernel to
split your I/Os, so it's not at all a given that it'll help much or at
all to use very large sizes, but YMMV. I was originally cautious
because I didn't want to make a few stack buffers too big, but arrays
of BlockNumber, struct iovec, and pointer don't seem too excessive at
say 128 (cf whole blocks on the stack, a thing we do, which would
still be many times larger that the relevant arrays). I was also
anticipating future code that would need to multiply that number by
other terms to allocate shared memory, but after some off-list
discussion, that seems OK: such code should be able to deal with that
using GUCs instead of maximally pessimal allocation. 128 gives a nice
round number of 1M as a maximum transfer size, and comparable systems
seem to have upper limits around that mark. Patch attached.

[1]: /messages/by-id/85696b8e-f1bf-459e-ba97-5608c644c185@vondra.me

Attachments:

0001-Increase-io_combine_limit-maximum-to-1MB.patchtext/x-patch; charset=US-ASCII; name=0001-Increase-io_combine_limit-maximum-to-1MB.patchDownload
From e8bb23b2130023ec497855c239b0299421b6833d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 21 Jan 2025 11:44:35 +1300
Subject: [PATCH] Increase io_combine_limit maximum to 1MB.

The default value of 128kB is not changed, but the fairly arbitrary
upper limit is changed from 256kB to 1MB (assuming default block size,
and assuming the operating system doesn't limit us to a smaller size).
Whether your system can really do 1MB physical transfers without
splitting them up is another matter, but at least having the option to
try is useful.  Other RDBMSes seem to have limits around that number.

The concrete change is to our definition of PG_IOV_MAX, which in turn
controls MAX_IO_COMBINE_LIMIT.  The latter limits the GUC used for
buffer pool I/O, but the former also affects a couple of other places
that work with arrays of struct iovec or smaller objects on the stack
and always use the compile-time maximum.  Therefore we still don't want
to use IOV_MAX directly, which is not under our control and likely to be
an unnecessarily high 1024.  128 blocks seems acceptable though.

The only current Unix on our target list that is known to have a very
low IOV_MAX is older Solaris, though even it changed recently.  Solaris
11.4 SRU69 added preadv/pwritev, 11.4 SRU72 bumped IOV_MAX from 16 to
1024.  Our Solaris build farm animals don't have either of those updates
yet so are currently testing the fallback code (like Windows).  Note
that its open source cousin illumos already made equivalent changes
years ago.

For Windows, we can't use real scatter/gather yet (though it's possible,
in later work), so we continue to provide an IOV_MAX value of 16 and
emulate preadv()/pwritev() with loops.  It's debatable whether we should
increase that number too: it still benefits from I/O combining sometimes
when buffers happen to be consecutive in memory.  Someone would need to
research that.

This change also makes it theoretically possible for read_stream.c's
internal cap of INT16_MAX to be hit.  With effective_io_concurrency set
to 1000 and io_combine_limit set to 1MB, you could theoretically want to
pin 128K buffers at once (= 1GB of data), but there is a cap at ~32K
buffers that stems from the choice of data type.  That could be
revisited in future, but in practice several other limits are very
likely to kick in first.

Suggested-by: Tomas Vondra <tomas@vondra.me>
---
 doc/src/sgml/config.sgml                      | 2 ++
 src/backend/storage/aio/read_stream.c         | 9 ++++++---
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 src/include/port/pg_iovec.h                   | 8 ++++++--
 4 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 38244409e3c..5f7d1e41951 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2614,6 +2614,8 @@ include_dir 'conf.d'
        <listitem>
         <para>
          Controls the largest I/O size in operations that combine I/O.
+         The maximum allowed size depends on the operating system and block
+         size, but is typically 1MB on Unix and 128kB on Windows.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..b5a8f3ed941 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -438,9 +438,12 @@ read_stream_begin_impl(int flags,
 	 * Choose the maximum number of buffers we're prepared to pin.  We try to
 	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
 	 * so that we can have a chance to build up a full io_combine_limit sized
-	 * read, even when max_ios is zero.  Be careful not to allow int16 to
-	 * overflow (even though that's not possible with the current GUC range
-	 * limits), allowing also for the spare entry and the overflow space.
+	 * read, even when max_ios is zero.
+	 *
+	 * Be careful not to allow int16 to overflow.  That is possible with the
+	 * current GUC range limits, so this is an artificial limit of ~32k
+	 * buffers and we'd need to adjust the types to exceed that.  We also have
+	 * to allow for the spare entry and the overflow space.
 	 */
 	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c40b7a3121e..bd2fca53612 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -197,7 +197,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
-#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index d9891d3805d..426ef978301 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -33,8 +33,12 @@ struct iovec
 
 #endif
 
-/* Define a reasonable maximum that is safe to use on the stack. */
-#define PG_IOV_MAX Min(IOV_MAX, 32)
+/*
+ * Define a reasonable maximum that is safe to use on the stack in arrays of
+ * struct iovec and other small types.  The operating system may limit us to a
+ * lower value, though most systems have 1024.
+ */
+#define PG_IOV_MAX Min(IOV_MAX, 128)
 
 /*
  * Like preadv(), but with a prefix to remind us of a side-effect: on Windows
-- 
2.39.5

#2Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#1)
Re: Allow io_combine_limit up to 1MB

Hi,

On 2025-02-11 13:12:17 +1300, Thomas Munro wrote:

Tomas queried[1] the limit of 256kB (or really 32 blocks) for
io_combine_limit. Yeah, I think we should increase it and allow
experimentation with larger numbers. Note that real hardware and
protocols have segment and size limits that can force the kernel to
split your I/Os, so it's not at all a given that it'll help much or at
all to use very large sizes, but YMMV.

FWIW, I see substantial performance *regressions* with *big* IO sizes using
fio. Just looking at cached buffered IO.

for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered 0 --ioengine psync --buffered 1 --invalidate 0 --output-format json --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done

io size kB throughput in MB/s
4 6752
8 9297
16 11082
32 14392
64 15967
128 16658
256 16864
512 19114
1024 12874
2048 11770
4096 11781
8192 11744

I.e. throughput peaks at 19GB/s and drops of fairly noticeably after that.

I've measured this on a number of different AMD and Intel Systems, with
similar results, albeit with different inflection points. On the Intel systems
I have access to the point where things slows down seems typically be earlier
than on AMD.

It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
*vastly* better performance:

io size kB throughput in MB/s
4 12054
8 13872
16 16709
32 20564
64 22559
128 23133
256 23317
512 25829
1024 15912
2048 15213
4096 14129
8192 13795

Most of the gain isn't due to mitigations=off but clearcpuid=smap. Apparently
SMAP, which requires explicit code to allow kernel space to access userspace
memory, to make exploitation harder, reacts badly to copying lots of memory.

This seems absolutely bonkers to me.

I was originally cautious because I didn't want to make a few stack buffers
too big, but arrays of BlockNumber, struct iovec, and pointer don't seem too
excessive at say 128 (cf whole blocks on the stack, a thing we do, which
would still be many times larger that the relevant arrays). I was also
anticipating future code that would need to multiply that number by other
terms to allocate shared memory, but after some off-list discussion, that
seems OK: such code should be able to deal with that using GUCs instead of
maximally pessimal allocation. 128 gives a nice round number of 1M as a
maximum transfer size, and comparable systems seem to have upper limits
around that mark. Patch attached.

To make that possible we'd need two different io_combine_limit GUCs, one
PGC_POSTMASTER that defines a hard max, and one that can be changed at
runtime, up to the PGC_POSTMASTER one.

It's somewhat painful to have such GUCs, because we don't have real
infrastructure for interdependent GUCs. Typically the easiest way is to just
do a Min() at runtime between the two GUCs. But looking at the number of
references to io_combine_limit in read_stream.c, that doesn't look like fun.

Do you have a good idea how to keep read_stream.c readable?

Greetings,

Andres Freund

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#2)
1 attachment(s)
Re: Allow io_combine_limit up to 1MB

On Wed, Feb 12, 2025 at 1:03 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-02-11 13:12:17 +1300, Thomas Munro wrote:

I was also
anticipating future code that would need to multiply that number by other
terms to allocate shared memory, but after some off-list discussion, that
seems OK: such code should be able to deal with that using GUCs instead of
maximally pessimal allocation. 128 gives a nice round number of 1M as a
maximum transfer size, and comparable systems seem to have upper limits
around that mark. Patch attached.

To make that possible we'd need two different io_combine_limit GUCs, one
PGC_POSTMASTER that defines a hard max, and one that can be changed at
runtime, up to the PGC_POSTMASTER one.

It's somewhat painful to have such GUCs, because we don't have real
infrastructure for interdependent GUCs. Typically the easiest way is to just
do a Min() at runtime between the two GUCs. But looking at the number of
references to io_combine_limit in read_stream.c, that doesn't look like fun.

Do you have a good idea how to keep read_stream.c readable?

How about just maintaining it in a new variable
effective_io_combine_limit, whenever either of them is assigned? You
don't get a friendly error message if you set the user-changeable one
too high, but the relationship between them can at least be clearly
documented. Something like this... (I didn't update the name in lots
of comments because it would reflow all the text...).

Attachments:

0001-Introduce-max_io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=0001-Introduce-max_io_combine_limit.patchDownload
From 6fc1187d92f2960379818fd0c35d0a995f9e7a67 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Feb 2025 13:52:22 +1300
Subject: [PATCH] Introduce max_io_combine_limit.

The existing io_combine_limit parameter can be set by users.  The new
max_io_combine_limit parameter, set only at server startup time, applies
as a silent limit on top of that.  The effective I/O combine limit is
maintained in a variable effective_io_combine_limit, and has the
minimum of the two values.

This allows the administrator to cap the total I/O size system-wide, but
also provides a way for proposed patches to know what the maximum could
possibly be in cases where it is multiplied by other GUCs to allocate
shared memory.

The read_stream.c code is changed to respect effective_io_combine_limit.
---
 doc/src/sgml/config.sgml                      | 23 ++++++++++++++++++-
 src/backend/commands/variable.c               | 18 +++++++++++++++
 src/backend/storage/aio/read_stream.c         | 20 ++++++++--------
 src/backend/storage/buffer/bufmgr.c           |  5 +++-
 src/backend/utils/misc/guc_tables.c           | 16 ++++++++++++-
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/bufmgr.h                  |  2 ++
 src/include/utils/guc_hooks.h                 |  2 ++
 8 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3b557ecabfb..c6de8b9e236 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2605,6 +2605,24 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-io-combine-limit" xreflabel="max_io_combine_limit">
+       <term><varname>max_io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O, and silently
+         limits the user-settable parameter <varname>io_combine_limit</varname>.
+         This parameter can only be set in
+         the <filename>postgresql.conf</filename> file or on the server
+         command line.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
        <term><varname>io_combine_limit</varname> (<type>integer</type>)
        <indexterm>
@@ -2613,7 +2631,10 @@ include_dir 'conf.d'
        </term>
        <listitem>
         <para>
-         Controls the largest I/O size in operations that combine I/O.
+         Controls the largest I/O size in operations that combine I/O.  If set
+         higher than the <varname>max_io_combine_limit</varname> parameter, the
+         smaller value will silently be used instead, so both may need to be raised
+         to increase the I/O size.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..62c88b6491a 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1156,6 +1156,24 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 #endif
 }
 
+/*
+ * GUC assign hooks that recompute effective_io_combine_limit whenever
+ * io_combine_limit and max_io_combine_limit are set.  These are needed because
+ * the GUC subsystem doesn't support dependencies between GUCs, and they may be
+ * assigned in either order.
+ */
+void
+assign_max_io_combine_limit(int newval, void *extra)
+{
+	max_io_combine_limit = newval;
+	effective_io_combine_limit = Min(max_io_combine_limit, io_combine_limit);
+}
+void
+assign_io_combine_limit(int newval, void *extra)
+{
+	io_combine_limit = newval;
+	effective_io_combine_limit = Min(max_io_combine_limit, io_combine_limit);
+}
 
 /*
  * These show hooks just exist because we want to show the values in octal.
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..b3c17ab9404 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -225,7 +225,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= effective_io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -313,7 +313,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == effective_io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -373,7 +373,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == io_combine_limit ||
+		(stream->pending_read_nblocks == effective_io_combine_limit ||
 		 (stream->pending_read_nblocks == stream->distance &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
@@ -442,9 +442,9 @@ read_stream_begin_impl(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios * 4, effective_io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
-							 PG_INT16_MAX - io_combine_limit - 1);
+							 PG_INT16_MAX - effective_io_combine_limit - 1);
 
 	/* Give the strategy a chance to limit the number of buffers we pin. */
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
@@ -474,14 +474,14 @@ read_stream_begin_impl(int flags,
 	 * io_combine_limit - 1 elements.
 	 */
 	size = offsetof(ReadStream, buffers);
-	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(Buffer) * (queue_size + effective_io_combine_limit - 1);
 	size += sizeof(InProgressIO) * Max(1, max_ios);
 	size += per_buffer_data_size * queue_size;
 	size += MAXIMUM_ALIGNOF * 2;
 	stream = (ReadStream *) palloc(size);
 	memset(stream, 0, offsetof(ReadStream, buffers));
 	stream->ios = (InProgressIO *)
-		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+		MAXALIGN(&stream->buffers[queue_size + effective_io_combine_limit - 1]);
 	if (per_buffer_data_size > 0)
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
@@ -522,7 +522,7 @@ read_stream_begin_impl(int flags,
 	 * doing full io_combine_limit sized reads (behavior B).
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+		stream->distance = Min(max_pinned_buffers, effective_io_combine_limit);
 	else
 		stream->distance = 1;
 
@@ -737,14 +737,14 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > io_combine_limit)
+			if (stream->distance > effective_io_combine_limit)
 			{
 				stream->distance--;
 			}
 			else
 			{
 				distance = stream->distance * 2;
-				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, effective_io_combine_limit);
 				distance = Min(distance, stream->max_pinned_buffers);
 				stream->distance = distance;
 			}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ee83669992b..03953ae48c5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -160,8 +160,11 @@ int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 /*
  * Limit on how many blocks should be handled in single I/O operations.
  * StartReadBuffers() callers should respect it, as should other operations
- * that call smgr APIs directly.
+ * that call smgr APIs directly.  It is computed as the minimum of underlying
+ * GUCs io_combine_limit and max_io_combine_limit.
  */
+int			effective_io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+int			max_io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 226af43fe23..8f6966b397f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3216,6 +3216,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"max_io_combine_limit",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("System-wide limit applied to io_combine_limit."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&max_io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, assign_max_io_combine_limit, NULL
+	},
+
 	{
 		{"io_combine_limit",
 			PGC_USERSET,
@@ -3227,7 +3241,7 @@ struct config_int ConfigureNamesInt[] =
 		&io_combine_limit,
 		DEFAULT_IO_COMBINE_LIMIT,
 		1, MAX_IO_COMBINE_LIMIT,
-		NULL, NULL, NULL
+		NULL, assign_io_combine_limit, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d472987ed46..cae4ffbdce7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -197,6 +197,8 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#max_io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 # - Worker Processes -
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..8aa4a2f8fb6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -164,6 +164,8 @@ extern PGDLLIMPORT int maintenance_io_concurrency;
 #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
 #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
+extern PGDLLIMPORT int max_io_combine_limit;
+extern PGDLLIMPORT int effective_io_combine_limit;	/* min of above pair */
 
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 87999218d68..48094641251 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -82,6 +82,8 @@ extern const char *show_log_timezone(void);
 extern bool check_maintenance_io_concurrency(int *newval, void **extra,
 											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
+extern void assign_max_io_combine_limit(int newval, void *extra);
+extern void assign_io_combine_limit(int newval, void *extra);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
 extern void assign_max_wal_size(int newval, void *extra);
-- 
2.48.1

#4Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#3)
Re: Allow io_combine_limit up to 1MB

Hi,

On 2025-02-12 13:59:21 +1300, Thomas Munro wrote:

How about just maintaining it in a new variable
effective_io_combine_limit, whenever either of them is assigned?

Yea, that's probably the least bad way.

I wonder if we should just name that variable io_combine_limit and have the
GUC be _raw or _guc or such? There's gonna be a fair number of references to
the variable in code...

Greetings,

Andres Freund

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#4)
1 attachment(s)
Re: Allow io_combine_limit up to 1MB

On Wed, Feb 12, 2025 at 3:22 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-02-12 13:59:21 +1300, Thomas Munro wrote:

How about just maintaining it in a new variable
effective_io_combine_limit, whenever either of them is assigned?

Yea, that's probably the least bad way.

I wonder if we should just name that variable io_combine_limit and have the
GUC be _raw or _guc or such? There's gonna be a fair number of references to
the variable in code...

Alternatively, we could compute that as stream->io_combine_limit and
use that. That has the advantage that it's fixed for the life of the
stream, even if you change it (eg between fetches from a CURSOR that
has streams). Pretty sure it won't break anything today, but it might
just run out of queue space limiting concurrency arbitrarily if you
increase it, which is a bit weird now that I focus on that. Capturing
the value we'll use up front seems better on that front. On the other
hand, other future code might also have to remember to compute that
too (write streams, ...), a tiny bit of duplication. Something like
the attached. Or ... I guess we could do both things?

Attachments:

v2-0001-Introduce-max_io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Introduce-max_io_combine_limit.patchDownload
From 8cfa23a370a4564a0369991e2b0068b48983a0f6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Feb 2025 13:52:22 +1300
Subject: [PATCH v2] Introduce max_io_combine_limit.

The existing io_combine_limit parameter can be set by users.  The new
max_io_combine_limit parameter can be set only at server startup time.
Code that combines I/Os should respect both of them by taking the
smaller value.

This allows the administrator to cap the total I/O size system-wide, but
also provides a way for proposed patches to know what the maximum could
possibly be in cases where it is multiplied by other GUCs to allocate
shared memory, without having to assume that it's as high as the
compile-time MAX_IO_COMBINE_LIMIT value.

The read_stream.c code is changed to compute the minimum value up front
as stream->io_combine_limit instead of using io_combine_limit directly.
That has the extra benefit of remaining stable throughout the lifetime
of the stream even if the user changes it (eg while consuming from a
CURSOR).  As previously coded, an mid-stream increase could limit
concurrency artificially just because we run out of queue space too
soon.
---
 doc/src/sgml/config.sgml                      | 23 ++++++++++++++-
 src/backend/commands/variable.c               |  1 -
 src/backend/storage/aio/read_stream.c         | 29 ++++++++++++-------
 src/backend/storage/buffer/bufmgr.c           |  5 ++--
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/bufmgr.h                  |  1 +
 6 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3b557ecabfb..c6de8b9e236 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2605,6 +2605,24 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-io-combine-limit" xreflabel="max_io_combine_limit">
+       <term><varname>max_io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O, and silently
+         limits the user-settable parameter <varname>io_combine_limit</varname>.
+         This parameter can only be set in
+         the <filename>postgresql.conf</filename> file or on the server
+         command line.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
        <term><varname>io_combine_limit</varname> (<type>integer</type>)
        <indexterm>
@@ -2613,7 +2631,10 @@ include_dir 'conf.d'
        </term>
        <listitem>
         <para>
-         Controls the largest I/O size in operations that combine I/O.
+         Controls the largest I/O size in operations that combine I/O.  If set
+         higher than the <varname>max_io_combine_limit</varname> parameter, the
+         smaller value will silently be used instead, so both may need to be raised
+         to increase the I/O size.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..8f77eb6ee79 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1156,7 +1156,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 #endif
 }
 
-
 /*
  * These show hooks just exist because we want to show the values in octal.
  */
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..ad3da5470c4 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -108,6 +108,7 @@ typedef struct InProgressIO
  */
 struct ReadStream
 {
+	int16		io_combine_limit;
 	int16		max_ios;
 	int16		ios_in_progress;
 	int16		queue_size;
@@ -225,7 +226,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -313,7 +314,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -373,7 +374,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == io_combine_limit ||
+		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks == stream->distance &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
@@ -402,6 +403,7 @@ read_stream_begin_impl(int flags,
 					   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
+	int			effective_io_combine_limit;
 	size_t		size;
 	int16		queue_size;
 	int			max_ios;
@@ -409,6 +411,12 @@ read_stream_begin_impl(int flags,
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 
+	/*
+	 * Respect both the system-wide limit and the user-settable limit on I/O
+	 * combining size.
+	 */
+	effective_io_combine_limit = Min(max_io_combine_limit, io_combine_limit);
+
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
 	 * currently means advice to the kernel to tell it that we will soon read.
@@ -442,9 +450,9 @@ read_stream_begin_impl(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios * 4, effective_io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
-							 PG_INT16_MAX - io_combine_limit - 1);
+							 PG_INT16_MAX - effective_io_combine_limit - 1);
 
 	/* Give the strategy a chance to limit the number of buffers we pin. */
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
@@ -474,14 +482,14 @@ read_stream_begin_impl(int flags,
 	 * io_combine_limit - 1 elements.
 	 */
 	size = offsetof(ReadStream, buffers);
-	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(Buffer) * (queue_size + effective_io_combine_limit - 1);
 	size += sizeof(InProgressIO) * Max(1, max_ios);
 	size += per_buffer_data_size * queue_size;
 	size += MAXIMUM_ALIGNOF * 2;
 	stream = (ReadStream *) palloc(size);
 	memset(stream, 0, offsetof(ReadStream, buffers));
 	stream->ios = (InProgressIO *)
-		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+		MAXALIGN(&stream->buffers[queue_size + effective_io_combine_limit - 1]);
 	if (per_buffer_data_size > 0)
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
@@ -508,6 +516,7 @@ read_stream_begin_impl(int flags,
 	if (max_ios == 0)
 		max_ios = 1;
 
+	stream->io_combine_limit = effective_io_combine_limit;
 	stream->max_ios = max_ios;
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
@@ -522,7 +531,7 @@ read_stream_begin_impl(int flags,
 	 * doing full io_combine_limit sized reads (behavior B).
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+		stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
 	else
 		stream->distance = 1;
 
@@ -737,14 +746,14 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > io_combine_limit)
+			if (stream->distance > stream->io_combine_limit)
 			{
 				stream->distance--;
 			}
 			else
 			{
 				distance = stream->distance * 2;
-				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->io_combine_limit);
 				distance = Min(distance, stream->max_pinned_buffers);
 				stream->distance = distance;
 			}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ee83669992b..3ccee6557d0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,9 +159,10 @@ int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 
 /*
  * Limit on how many blocks should be handled in single I/O operations.
- * StartReadBuffers() callers should respect it, as should other operations
- * that call smgr APIs directly.
+ * StartReadBuffers() callers should respect both of these, as should other
+ * operations that call smgr APIs directly.
  */
+int			max_io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 
 /*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d472987ed46..cae4ffbdce7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -197,6 +197,8 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#max_io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 # - Worker Processes -
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..246192b78ad 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -164,6 +164,7 @@ extern PGDLLIMPORT int maintenance_io_concurrency;
 #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
 #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
+extern PGDLLIMPORT int max_io_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
-- 
2.48.1

#6Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#5)
Re: Allow io_combine_limit up to 1MB

Hi,

On 2025-02-12 15:24:21 +1300, Thomas Munro wrote:

On Wed, Feb 12, 2025 at 3:22 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-02-12 13:59:21 +1300, Thomas Munro wrote:

How about just maintaining it in a new variable
effective_io_combine_limit, whenever either of them is assigned?

Yea, that's probably the least bad way.

I wonder if we should just name that variable io_combine_limit and have the
GUC be _raw or _guc or such? There's gonna be a fair number of references to
the variable in code...

Alternatively, we could compute that as stream->io_combine_limit and
use that. That has the advantage that it's fixed for the life of the
stream, even if you change it (eg between fetches from a CURSOR that
has streams). Pretty sure it won't break anything today, but it might
just run out of queue space limiting concurrency arbitrarily if you
increase it, which is a bit weird now that I focus on that. Capturing
the value we'll use up front seems better on that front.

On the other hand, other future code might also have to remember to compute
that too (write streams, ...), a tiny bit of duplication.

Yep, was also "worried" about that.

Or ... I guess we could do both things?

Maybe that'd be the best approach? Not sure.

From 8cfa23a370a4564a0369991e2b0068b48983a0f6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Feb 2025 13:52:22 +1300
Subject: [PATCH v2] Introduce max_io_combine_limit.

The existing io_combine_limit parameter can be set by users. The new
max_io_combine_limit parameter can be set only at server startup time.
Code that combines I/Os should respect both of them by taking the
smaller value.

This allows the administrator to cap the total I/O size system-wide, but
also provides a way for proposed patches to know what the maximum could
possibly be in cases where it is multiplied by other GUCs to allocate
shared memory, without having to assume that it's as high as the
compile-time MAX_IO_COMBINE_LIMIT value.

The read_stream.c code is changed to compute the minimum value up front
as stream->io_combine_limit instead of using io_combine_limit directly.
That has the extra benefit of remaining stable throughout the lifetime
of the stream even if the user changes it (eg while consuming from a
CURSOR). As previously coded, an mid-stream increase could limit
concurrency artificially just because we run out of queue space too
soon.
---
doc/src/sgml/config.sgml | 23 ++++++++++++++-
src/backend/commands/variable.c | 1 -
src/backend/storage/aio/read_stream.c | 29 ++++++++++++-------
src/backend/storage/buffer/bufmgr.c | 5 ++--
src/backend/utils/misc/postgresql.conf.sample | 2 ++
src/include/storage/bufmgr.h | 1 +
6 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3b557ecabfb..c6de8b9e236 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2605,6 +2605,24 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+      <varlistentry id="guc-max-io-combine-limit" xreflabel="max_io_combine_limit">
+       <term><varname>max_io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O, and silently
+         limits the user-settable parameter <varname>io_combine_limit</varname>.
+         This parameter can only be set in
+         the <filename>postgresql.conf</filename> file or on the server
+         command line.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>

I can see an argument for having the max be slightly higher than the default,
but still less than MAX_IO_COMBINE_LIMIT. But I think just about anything is
fine for now.

--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1156,7 +1156,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
#endif
}

-
/*
* These show hooks just exist because we want to show the values in octal.
*/

Bogus hunk?

@@ -402,6 +403,7 @@ read_stream_begin_impl(int flags,
size_t per_buffer_data_size)
{
ReadStream *stream;
+ int effective_io_combine_limit;
size_t size;
int16 queue_size;
int max_ios;
@@ -409,6 +411,12 @@ read_stream_begin_impl(int flags,
uint32 max_pinned_buffers;
Oid tablespace_id;

+	/*
+	 * Respect both the system-wide limit and the user-settable limit on I/O
+	 * combining size.
+	 */
+	effective_io_combine_limit = Min(max_io_combine_limit, io_combine_limit);
+
/*
* Decide how many I/Os we will allow to run at the same time.  That
* currently means advice to the kernel to tell it that we will soon read.

Heh, somehow effective_* now gives me hives almost immediately :)

Greetings,

Andres Freund

#7Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#2)
Re: Allow io_combine_limit up to 1MB

On Wed, Feb 12, 2025 at 1:03 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-02-11 13:12:17 +1300, Thomas Munro wrote:

Tomas queried[1] the limit of 256kB (or really 32 blocks) for
io_combine_limit. Yeah, I think we should increase it and allow
experimentation with larger numbers. Note that real hardware and
protocols have segment and size limits that can force the kernel to
split your I/Os, so it's not at all a given that it'll help much or at
all to use very large sizes, but YMMV.

+0.02 to the initiative, I've been always wondering why the IOs were
so capped, I know :)

FWIW, I see substantial performance *regressions* with *big* IO sizes using
fio. Just looking at cached buffered IO.

for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered 0 --ioengine psync --buffered 1 --invalidate 0 --output-format json --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done

io size kB throughput in MB/s

[..]

256 16864
512 19114
1024 12874

[..]

It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
*vastly* better performance:

io size kB throughput in MB/s

[..]

128 23133
256 23317
512 25829
1024 15912

[..]

Most of the gain isn't due to mitigations=off but clearcpuid=smap. Apparently
SMAP, which requires explicit code to allow kernel space to access userspace
memory, to make exploitation harder, reacts badly to copying lots of memory.

This seems absolutely bonkers to me.

There are two bizarre things there, +35% perf boost just like that due
to security drama, and that io_size=512kb being so special to give a
10-13% boost in Your case? Any ideas, why? I've got on that Lsv2
individual MS nvme under Hyper-V, on ext4, which seems to be much more
real world and average Joe situation, and it is much slower, but it is
not showing advantage for blocksize beyond let's say 128:

io size kB throughput in MB/s
4 1070
8 1117
16 1231
32 1264
64 1249
128 1313
256 1323
512 1257
1024 1216
2048 1271
4096 1304
8192 1214

top hitter on of course stuff like clear_page_rep [k] and
rep_movs_alternative [k] (that was with mitigations=on).

-J.

#8Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#7)
Re: Allow io_combine_limit up to 1MB

Hi,

On 2025-02-14 09:32:32 +0100, Jakub Wartak wrote:

On Wed, Feb 12, 2025 at 1:03 AM Andres Freund <andres@anarazel.de> wrote:

FWIW, I see substantial performance *regressions* with *big* IO sizes using
fio. Just looking at cached buffered IO.

for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered 0 --ioengine psync --buffered 1 --invalidate 0 --output-format json --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done

io size kB throughput in MB/s

[..]

256 16864
512 19114
1024 12874

[..]

It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
*vastly* better performance:

io size kB throughput in MB/s

[..]

128 23133
256 23317
512 25829
1024 15912

[..]

Most of the gain isn't due to mitigations=off but clearcpuid=smap. Apparently
SMAP, which requires explicit code to allow kernel space to access userspace
memory, to make exploitation harder, reacts badly to copying lots of memory.

This seems absolutely bonkers to me.

There are two bizarre things there, +35% perf boost just like that due
to security drama, and that io_size=512kb being so special to give a
10-13% boost in Your case? Any ideas, why?

I think there are a few overlapping "cost factors" and that turns out to be
the global minimum:
- syscall overhead: the fewer the better
- memory copy cost: higher for small-ish amounts, then lower
- smap costs: seems to increase with larger amounts of memory
- CPU cache: copying less than L3 cache will be faster, as otherwise memory
bandwidth plays a role

I've got on that Lsv2
individual MS nvme under Hyper-V, on ext4, which seems to be much more
real world and average Joe situation, and it is much slower, but it is
not showing advantage for blocksize beyond let's say 128:

io size kB throughput in MB/s
4 1070
8 1117
16 1231
32 1264
64 1249
128 1313
256 1323
512 1257
1024 1216
2048 1271
4096 1304
8192 1214

top hitter on of course stuff like clear_page_rep [k] and
rep_movs_alternative [k] (that was with mitigations=on).

I think you're measuring something different than I was. I was purposefully
measuring a fully-cached workload, which worked with that recipe, because I
have more than 32GB of RAM available. But I assume you're running this in a VM
that doesnt have that much, and thus your're actually bencmarking reading data
from disk and - probably more influential in this case - finding buffers to
put the newly read data in.

Greetings,

Andres Freund

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#6)
3 attachment(s)
Re: Allow io_combine_limit up to 1MB

On Wed, Feb 12, 2025 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Feb 12, 2025 at 3:22 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-02-12 13:59:21 +1300, Thomas Munro wrote:

How about just maintaining it in a new variable
effective_io_combine_limit, whenever either of them is assigned?

Yea, that's probably the least bad way.

I wonder if we should just name that variable io_combine_limit and have the
GUC be _raw or _guc or such? There's gonna be a fair number of references to
the variable in code...

Alternatively, we could compute that as stream->io_combine_limit and
use that. That has the advantage that it's fixed for the life of the
stream, even if you change it (eg between fetches from a CURSOR that
has streams). Pretty sure it won't break anything today,

Unfortunately that turned out to be untrue. :-( 0001 is a patch to
address that, which should be back-patched. It's hard to come up with
a repro for an actual problem, but I'm sure it's possible, will try...

0002 and 0003 are new versions of the patches to add
max_io_combine_limit and increase MAX_IO_COMBINE_LIMIT to 1MB. This
time using the name io_combine_limit_guc, so that io_combine_limit
remains as the name referred to in the rest of the tree. I
imagine/hope that we'll one day figure out how to make at least the
easy case (?) of dependencies on PGC_POSTMASTER GUCs work for
PGC_USERSET values, and then io_combine_limit_guc could be deleted...

Attachments:

v3-0001-Fix-read_stream.c-for-changing-io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-read_stream.c-for-changing-io_combine_limit.patchDownload
From a9d09c0140b36cd5acf797ebb998ba7018590888 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Feb 2025 16:52:43 +1300
Subject: [PATCH v3 1/3] Fix read_stream.c for changing io_combine_limit.

In a couple of places, read_stream.c assumed that io_combine_limit would
stable during the lifetime of a stream.  That is not true in at least
one unusual case: streams held by CURSORs where you could change the GUC
between FETCH commands, with unpredictable results.

Fix, by storing stream->io_combine_limit and referring only to that
after construction.  This mirrors the treatment of the other important
setting {effective,maintenance}_io_concurrency, which is stored in
stream->max_ios.

One of the cases was the queue overflow space, which was sized for
io_combine_limit and could be overrun if the GUC was increased.  Since
that coding was a little hard to follow, also introduce a variable for
better readability instead of open-coding the arithmetic.  Doing so
revealed an off-by-one thinko while clamping max_pinned_buffers to
INT16_MAX, though that wasn't a live bug due to the current limits on
GUC values.

Back-patch to 17.

Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 38 +++++++++++++++++++--------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 04bdb5e6d4b..36fb9fe152c 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -109,6 +109,7 @@ typedef struct InProgressIO
 struct ReadStream
 {
 	int16		max_ios;
+	int16		io_combine_limit;
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
@@ -236,7 +237,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -324,7 +325,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -384,7 +385,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == io_combine_limit ||
+		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks == stream->distance &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
@@ -415,6 +416,7 @@ read_stream_begin_impl(int flags,
 	ReadStream *stream;
 	size_t		size;
 	int16		queue_size;
+	int16		queue_overflow;
 	int			max_ios;
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
@@ -445,6 +447,14 @@ read_stream_begin_impl(int flags,
 	/* Cap to INT16_MAX to avoid overflowing below */
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
+	/*
+	 * If starting a multi-block I/O near the end of the queue, we might
+	 * temporarily need extra space for overflowing buffers before they are
+	 * moved to regular circular position.  This is the maximum extra space we
+	 * could need.
+	 */
+	queue_overflow = io_combine_limit - 1;
+
 	/*
 	 * Choose the maximum number of buffers we're prepared to pin.  We try to
 	 * pin fewer if we can, though.  We add one so that we can make progress
@@ -459,7 +469,7 @@ read_stream_begin_impl(int flags,
 	 */
 	max_pinned_buffers = (max_ios + 1) * io_combine_limit;
 	max_pinned_buffers = Min(max_pinned_buffers,
-							 PG_INT16_MAX - io_combine_limit - 1);
+							 PG_INT16_MAX - queue_overflow - 1);
 
 	/* Give the strategy a chance to limit the number of buffers we pin. */
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
@@ -485,18 +495,17 @@ read_stream_begin_impl(int flags,
 	 * one big chunk.  Though we have queue_size buffers, we want to be able
 	 * to assume that all the buffers for a single read are contiguous (i.e.
 	 * don't wrap around halfway through), so we allow temporary overflows of
-	 * up to the maximum possible read size by allocating an extra
-	 * io_combine_limit - 1 elements.
+	 * up to the maximum possible overflow size.
 	 */
 	size = offsetof(ReadStream, buffers);
-	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(Buffer) * (queue_size + queue_overflow);
 	size += sizeof(InProgressIO) * Max(1, max_ios);
 	size += per_buffer_data_size * queue_size;
 	size += MAXIMUM_ALIGNOF * 2;
 	stream = (ReadStream *) palloc(size);
 	memset(stream, 0, offsetof(ReadStream, buffers));
 	stream->ios = (InProgressIO *)
-		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+		MAXALIGN(&stream->buffers[queue_size + queue_overflow]);
 	if (per_buffer_data_size > 0)
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
@@ -523,7 +532,14 @@ read_stream_begin_impl(int flags,
 	if (max_ios == 0)
 		max_ios = 1;
 
+	/*
+	 * Capture stable values for these two GUC-derived numbers for the
+	 * lifetime of this stream, so we don't have to worry about the GUCs
+	 * changing underneath us beyond this point.
+	 */
 	stream->max_ios = max_ios;
+	stream->io_combine_limit = io_combine_limit;
+
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
 	stream->queue_size = queue_size;
@@ -537,7 +553,7 @@ read_stream_begin_impl(int flags,
 	 * doing full io_combine_limit sized reads (behavior B).
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+		stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
 	else
 		stream->distance = 1;
 
@@ -752,14 +768,14 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > io_combine_limit)
+			if (stream->distance > stream->io_combine_limit)
 			{
 				stream->distance--;
 			}
 			else
 			{
 				distance = stream->distance * 2;
-				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->io_combine_limit);
 				distance = Min(distance, stream->max_pinned_buffers);
 				stream->distance = distance;
 			}
-- 
2.48.1

v3-0002-Introduce-max_io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Introduce-max_io_combine_limit.patchDownload
From b7a4bd996a0a296eeda9a3938ff965131825cb9f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Feb 2025 13:52:22 +1300
Subject: [PATCH v3 2/3] Introduce max_io_combine_limit.

The existing io_combine_limit parameter can be set by users.  The new
max_io_combine_limit parameter can be set only at server startup time,
and applies as a silent limit on top of that.

Since our GUC system doesn't allow dependencies between GUCs, the
user-settable one assigns a "raw" value to io_combine_limit_guc, and the
minimum of io_combine_limit_guc and max_io_combine_limit is maintained
in io_combine_limit.  In other words, code should continue to refer to
io_combine_limit directly, and io_combine_limit_guc is an implementation
detail that could eventually be removed if the GUC system ever learns
how to deal with dependencies.

max_io_combine_limit allows the administrator to cap the total I/O size
system-wide, and also provides a way for proposed AIO patches to know
what the maximum could possibly be when allocating related memory at
server startup, without having to assume it is the compile-time maximum
MAX_IO_COMBINE_LIMIT.  This paves the way for increasing the latter.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 23 ++++++++++++++++++-
 src/backend/commands/variable.c               | 18 +++++++++++++++
 src/backend/storage/buffer/bufmgr.c           |  5 +++-
 src/backend/utils/misc/guc_tables.c           | 16 ++++++++++++-
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/bufmgr.h                  |  4 +++-
 src/include/utils/guc_hooks.h                 |  2 ++
 7 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e55700f35b8..68ba7cc7980 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2631,6 +2631,24 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-io-combine-limit" xreflabel="max_io_combine_limit">
+       <term><varname>max_io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O, and silently
+         limits the user-settable parameter <varname>io_combine_limit</varname>.
+         This parameter can only be set in
+         the <filename>postgresql.conf</filename> file or on the server
+         command line.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
        <term><varname>io_combine_limit</varname> (<type>integer</type>)
        <indexterm>
@@ -2639,7 +2657,10 @@ include_dir 'conf.d'
        </term>
        <listitem>
         <para>
-         Controls the largest I/O size in operations that combine I/O.
+         Controls the largest I/O size in operations that combine I/O.  If set
+         higher than the <varname>max_io_combine_limit</varname> parameter, the
+         smaller value will silently be used instead, so both may need to be raised
+         to increase the I/O size.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..fab8f2c7958 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1156,6 +1156,24 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 #endif
 }
 
+/*
+ * GUC assign hooks that recompute io_combine_limit whenever
+ * io_combine_limit_guc and max_io_combine_limit are changed.  These are needed
+ * because the GUC subsystem doesn't support dependencies between GUCs, and
+ * they may be assigned in either order.
+ */
+void
+assign_max_io_combine_limit(int newval, void *extra)
+{
+	max_io_combine_limit = newval;
+	io_combine_limit = Min(max_io_combine_limit, io_combine_limit_guc);
+}
+void
+assign_io_combine_limit(int newval, void *extra)
+{
+	io_combine_limit_guc = newval;
+	io_combine_limit = Min(max_io_combine_limit, io_combine_limit_guc);
+}
 
 /*
  * These show hooks just exist because we want to show the values in octal.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7915ed624c1..af97d08b0ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -160,9 +160,12 @@ int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 /*
  * Limit on how many blocks should be handled in single I/O operations.
  * StartReadBuffers() callers should respect it, as should other operations
- * that call smgr APIs directly.
+ * that call smgr APIs directly.  It is computed as the minimum of underlying
+ * GUCs io_combine_limit_guc and max_io_combine_limit.
  */
 int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+int			io_combine_limit_guc = DEFAULT_IO_COMBINE_LIMIT;
+int			max_io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c5..d1cb93562d1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3240,6 +3240,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"max_io_combine_limit",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("System-wide limit applied to io_combine_limit."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&max_io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, assign_max_io_combine_limit, NULL
+	},
+
 	{
 		{"io_combine_limit",
 			PGC_USERSET,
@@ -3251,7 +3265,7 @@ struct config_int ConfigureNamesInt[] =
 		&io_combine_limit,
 		DEFAULT_IO_COMBINE_LIMIT,
 		1, MAX_IO_COMBINE_LIMIT,
-		NULL, NULL, NULL
+		NULL, assign_io_combine_limit, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5362ff80519..1d63f9e5f36 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -200,6 +200,8 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#max_io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 # - Worker Processes -
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..91a2f548c3a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -163,7 +163,9 @@ extern PGDLLIMPORT int maintenance_io_concurrency;
 
 #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
 #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
-extern PGDLLIMPORT int io_combine_limit;
+extern PGDLLIMPORT int io_combine_limit;	/* min of the two GUCs below */
+extern PGDLLIMPORT int io_combine_limit_guc;
+extern PGDLLIMPORT int max_io_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 951451a9765..8f52af2fcee 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -82,6 +82,8 @@ extern const char *show_log_timezone(void);
 extern bool check_maintenance_io_concurrency(int *newval, void **extra,
 											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
+extern void assign_max_io_combine_limit(int newval, void *extra);
+extern void assign_io_combine_limit(int newval, void *extra);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
 extern void assign_max_wal_size(int newval, void *extra);
-- 
2.48.1

v3-0003-Increase-io_combine_limit-maximum-to-1MB.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Increase-io_combine_limit-maximum-to-1MB.patchDownload
From ab110f371c3fd690b193c543da6afad96988c70a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 21 Jan 2025 11:44:35 +1300
Subject: [PATCH v3 3/3] Increase io_combine_limit maximum to 1MB.

The default value of 128kB is not changed, but the arbitrarily chosen
upper limit is changed from 256kB to 1MB (assuming default block size,
and assuming the system IOV_MAX doesn't limit us to a smaller size).
Whether your system can really do 1MB physical transfers without
splitting them up is another matter, but having the option to experiment
is useful.  Other RDBMSes seem to have limits around that number.

The concrete change is to our definition of PG_IOV_MAX, which in turn
controls MAX_IO_COMBINE_LIMIT.  The latter limits the GUCs used for
buffer pool I/O, but the former also affects a couple of other places
that work with arrays of struct iovec or smaller objects on the stack
and always use the compile-time maximum.  Therefore we still don't want
to use IOV_MAX directly, which is not under our control and likely to be
an unnecessarily high 1024.  128 blocks seems acceptable though.

The only current Unix on our target list that is known to have a low
IOV_MAX is Solaris before 11.4 SRU72.  Our Solaris build farm animals
don't have that update, or even SRU69 for preadv/pwritev, so are
currently testing the fallback code as used on Windows.  (illumos
already made these changes many years ago.)

For Windows, we can't use real scatter/gather yet (though it's possible,
in later work), so we continue to provide an IOV_MAX value of 16 and
emulate preadv()/pwritev() with loops.  It's debatable whether we should
increase that number too: it still benefits from I/O combining sometimes
when buffers happen to be consecutive in memory.  Someone would need to
research that.

This change also makes it theoretically possible for read_stream.c's
internal cap of INT16_MAX to be hit.  With effective_io_concurrency set
to 1000 and io_combine_limit set to 1MB, you could theoretically want to
pin 128K buffers at once (= 1GB of data), but there is a cap at ~32K
buffers that stems from the choice of data type.  That could be
revisited in future, but in practice several other limits are very
likely to kick in first.

Suggested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 4 ++++
 src/backend/storage/aio/read_stream.c         | 7 ++++---
 src/backend/utils/misc/postgresql.conf.sample | 4 ++--
 src/include/port/pg_iovec.h                   | 8 ++++++--
 4 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 68ba7cc7980..a668ff08ab9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2644,6 +2644,8 @@ include_dir 'conf.d'
          This parameter can only be set in
          the <filename>postgresql.conf</filename> file or on the server
          command line.
+         The maximum possible size depends on the operating system and block
+         size, but is typically 1MB on Unix and 128kB on Windows.
          The default is 128kB.
         </para>
        </listitem>
@@ -2661,6 +2663,8 @@ include_dir 'conf.d'
          higher than the <varname>max_io_combine_limit</varname> parameter, the
          smaller value will silently be used instead, so both may need to be raised
          to increase the I/O size.
+         The maximum possible size depends on the operating system and block
+         size, but is typically 1MB on Unix and 128kB on Windows.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 36fb9fe152c..6939fb8ccf9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -463,9 +463,10 @@ read_stream_begin_impl(int flags,
 	 * finishes we don't want to have to wait for its buffers to be consumed
 	 * before starting a new one.
 	 *
-	 * Be careful not to allow int16 to overflow (even though that's not
-	 * possible with the current GUC range limits), allowing also for the
-	 * spare entry and the overflow space.
+	 * Be careful not to allow int16 to overflow.  That is possible with the
+	 * current GUC range limits, so this is an artificial limit of ~32k
+	 * buffers and we'd need to adjust the types to exceed that.  We also have
+	 * to allow for the spare entry and the overflow space.
 	 */
 	max_pinned_buffers = (max_ios + 1) * io_combine_limit;
 	max_pinned_buffers = Min(max_pinned_buffers,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1d63f9e5f36..1859c040b42 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -200,9 +200,9 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
-#max_io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#max_io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 					# (change requires restart)
-#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 
 # - Worker Processes -
 
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index d9891d3805d..df40c7208be 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -33,8 +33,12 @@ struct iovec
 
 #endif
 
-/* Define a reasonable maximum that is safe to use on the stack. */
-#define PG_IOV_MAX Min(IOV_MAX, 32)
+/*
+ * Define a reasonable maximum that is safe to use on the stack in arrays of
+ * struct iovec and other small types.  The operating system could limit us to
+ * a number as low as 16, but most systems have 1024.
+ */
+#define PG_IOV_MAX Min(IOV_MAX, 128)
 
 /*
  * Like preadv(), but with a prefix to remind us of a side-effect: on Windows
-- 
2.48.1

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#9)
3 attachment(s)
Re: Allow io_combine_limit up to 1MB

Here's a new version that also adjusts the code that just landed in da722699:

-       /*
-        * Each IO handle can have an PG_IOV_MAX long iovec.
-        *
-        * XXX: Right now the amount of space available for each IO is
PG_IOV_MAX.
-        * While it's tempting to use the io_combine_limit GUC, that's
-        * PGC_USERSET, so we can't allocate shared memory based on that.
-        */
+       /* each IO handle can have up to io_max_combine_limit iovec objects */
        return mul_size(sizeof(struct iovec),
-                                       mul_size(mul_size(PG_IOV_MAX,
AioProcs()),
+
mul_size(mul_size(io_max_combine_limit, AioProcs()),
                                                         io_max_concurrency));

While doing that, this juxtaposition jumped out at me:

+ uint32 per_backend_iovecs = io_max_concurrency *
max_io_combine_limit;

Surely one of these names has to give! First commit wins, so the new
name in this version is "io_max_combine_limit".

I was contemplating making the same change to the MAX_IO_COMBINE_LIMIT
macro when it occurred to me that we could just delete it. It has few
references and they could just use PG_IOV_MAX instead; it's perhaps a
little less clear as names go, but it's also pretty confusing that
there's a macro with the same name as a GUC...

Attachments:

v4-0001-Introduce-io_max_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Introduce-io_max_combine_limit.patchDownload
From 1200802f3d647a8c9107b2148595dab4b6a1bd82 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Feb 2025 13:52:22 +1300
Subject: [PATCH v4 1/3] Introduce io_max_combine_limit.

The existing io_combine_limit setting can be changed by users.  The new
io_max_combine_limit setting is defined only at server startup time, and
applies as a silent cap.

This allows aio_init.c to know what the maximum could possibly be when
allocating shared memory at startup, instead of having to assume
PG_IOV_MAX (which a later commit will quadruple).

Since our GUC system doesn't allow dependencies between GUCs, the
user-settable one now assigns a "raw" value to io_combine_limit_guc, and
the lower of io_combine_limit_guc and io_max_combine_limit is maintained
in io_combine_limit.  Most code should continue to refer to
io_combine_limit directly, and io_combine_limit_guc is an implementation
detail that could eventually be removed if the GUC system ever learns
how to deal with dependencies.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 23 ++++++++++++++++++-
 src/backend/commands/variable.c               | 18 +++++++++++++++
 src/backend/storage/aio/aio_init.c            | 17 +++++---------
 src/backend/storage/buffer/bufmgr.c           |  5 +++-
 src/backend/utils/misc/guc_tables.c           | 16 ++++++++++++-
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/bufmgr.h                  |  4 +++-
 src/include/utils/guc_hooks.h                 |  2 ++
 8 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ec18bb7627..d1080dac97f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2625,6 +2625,24 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-max-combine-limit" xreflabel="io_max_combine_limit">
+       <term><varname>io_max_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_max_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O, and silently
+         limits the user-settable parameter <varname>io_combine_limit</varname>.
+         This parameter can only be set in
+         the <filename>postgresql.conf</filename> file or on the server
+         command line.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
        <term><varname>io_combine_limit</varname> (<type>integer</type>)
        <indexterm>
@@ -2633,7 +2651,10 @@ include_dir 'conf.d'
        </term>
        <listitem>
         <para>
-         Controls the largest I/O size in operations that combine I/O.
+         Controls the largest I/O size in operations that combine I/O.  If set
+         higher than the <varname>io_max_combine_limit</varname> parameter, the
+         smaller value will silently be used instead, so both may need to be raised
+         to increase the I/O size.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..f550a3c0c63 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1156,6 +1156,24 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 #endif
 }
 
+/*
+ * GUC assign hooks that recompute io_combine_limit whenever
+ * io_combine_limit_guc and io_max_combine_limit are changed.  These are needed
+ * because the GUC subsystem doesn't support dependencies between GUCs, and
+ * they may be assigned in either order.
+ */
+void
+assign_io_max_combine_limit(int newval, void *extra)
+{
+	io_max_combine_limit = newval;
+	io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
+}
+void
+assign_io_combine_limit(int newval, void *extra)
+{
+	io_combine_limit_guc = newval;
+	io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
+}
 
 /*
  * These show hooks just exist because we want to show the values in octal.
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 6fe55510fae..93aba6e03ba 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -66,15 +67,9 @@ AioHandleShmemSize(void)
 static Size
 AioHandleIOVShmemSize(void)
 {
-	/*
-	 * Each IO handle can have an PG_IOV_MAX long iovec.
-	 *
-	 * XXX: Right now the amount of space available for each IO is PG_IOV_MAX.
-	 * While it's tempting to use the io_combine_limit GUC, that's
-	 * PGC_USERSET, so we can't allocate shared memory based on that.
-	 */
+	/* each IO handle can have up to io_max_combine_limit iovec objects */
 	return mul_size(sizeof(struct iovec),
-					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+					mul_size(mul_size(io_max_combine_limit, AioProcs()),
 							 io_max_concurrency));
 }
 
@@ -83,7 +78,7 @@ AioHandleDataShmemSize(void)
 {
 	/* each buffer referenced by an iovec can have associated data */
 	return mul_size(sizeof(uint64),
-					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+					mul_size(mul_size(io_max_combine_limit, AioProcs()),
 							 io_max_concurrency));
 }
 
@@ -154,7 +149,7 @@ AioShmemInit(void)
 	bool		found;
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
-	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		per_backend_iovecs = io_max_concurrency * io_max_combine_limit;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -207,7 +202,7 @@ AioShmemInit(void)
 			ConditionVariableInit(&ioh->cv);
 
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
-			iovec_off += PG_IOV_MAX;
+			iovec_off += io_max_combine_limit;
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 79ca9d18d07..d04afa5ab9c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -160,9 +160,12 @@ int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 /*
  * Limit on how many blocks should be handled in single I/O operations.
  * StartReadBuffers() callers should respect it, as should other operations
- * that call smgr APIs directly.
+ * that call smgr APIs directly.  It is computed as the minimum of underlying
+ * GUCs io_combine_limit_guc and io_max_combine_limit.
  */
 int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+int			io_combine_limit_guc = DEFAULT_IO_COMBINE_LIMIT;
+int			io_max_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0d3ebf06a95..720f28a5180 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3241,6 +3241,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_max_combine_limit",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("System-wide limit applied to io_combine_limit."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_max_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, assign_io_max_combine_limit, NULL
+	},
+
 	{
 		{"io_combine_limit",
 			PGC_USERSET,
@@ -3252,7 +3266,7 @@ struct config_int ConfigureNamesInt[] =
 		&io_combine_limit,
 		DEFAULT_IO_COMBINE_LIMIT,
 		1, MAX_IO_COMBINE_LIMIT,
-		NULL, NULL, NULL
+		NULL, assign_io_combine_limit, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 43c2ec2153e..bd9a3507135 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -200,6 +200,8 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_max_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 #io_method = sync			# sync (change requires restart)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 79a89f87fcc..71b3af61b6a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -163,7 +163,9 @@ extern PGDLLIMPORT int maintenance_io_concurrency;
 
 #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
 #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
-extern PGDLLIMPORT int io_combine_limit;
+extern PGDLLIMPORT int io_combine_limit;	/* min of the two GUCs below */
+extern PGDLLIMPORT int io_combine_limit_guc;
+extern PGDLLIMPORT int io_max_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index a3eba8fbe21..0f1e74f96c9 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -86,6 +86,8 @@ extern const char *show_log_timezone(void);
 extern bool check_maintenance_io_concurrency(int *newval, void **extra,
 											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
+extern void assign_io_max_combine_limit(int newval, void *extra);
+extern void assign_io_combine_limit(int newval, void *extra);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
 extern void assign_max_wal_size(int newval, void *extra);
-- 
2.39.5

v4-0002-Increase-the-maximum-I-O-combine-size-to-1MB.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Increase-the-maximum-I-O-combine-size-to-1MB.patchDownload
From 4abac6442061744b791031a25a31cd45563cc111 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 21 Jan 2025 11:44:35 +1300
Subject: [PATCH v4 2/3] Increase the maximum I/O combine size to 1MB.

The default value of 128kB is not changed, but the upper limit is
changed from 32 blocks to 128 blocks (1MB with 8kB blocks), assuming the
operating system's IOV_MAX doesn't limit us to a smaller size.  This is
around where some other RDBMSes seem to cap their buffer pool I/O size,
and it seems like to good idea to allow experiments with that.

The concrete change is to our definition of PG_IOV_MAX, which provides
the maximum limit for io_combine_limit and io_max_combine_limit.  It
also affects a couple of other places that work with arrays of struct
iovec or smaller objects on the stack, so we still don't want to use the
system IOV_MAX directly without a clamp: it is not under our control and
likely to be 1024.  128 seems acceptable for all our current use cases.

The last Unix on our target list known to have a low IOV_MAX was Solaris
before 11.4 SRU72 (it was 16, the minimum requirement for POSIX
conformance, but is now 1024, matching all other systems I looked at).

For Windows, we can't use real scatter/gather yet (though it's possible,
in later work), so we continue to define our own IOV_MAX value of 16 and
emulate preadv()/pwritev() with loops there.  Someone would need to
research the trade-off.

This change also makes it possible for read_stream.c's internal cap of
INT16_MAX to be hit, so adjust comments about that.  With
*_io_concurrent and io_combine_limit set to their maximum, it would want
to be able to pin 128K buffers at once (= 1GB of data), but the choice
of data type limits streams to 32K buffers.  That could be revisited in
future, but you'll probably hit other limits long before that one in
your quest to run 1,000 concurrent I/Os of size 1MB.

Suggested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 4 ++++
 src/backend/storage/aio/read_stream.c         | 7 ++++---
 src/backend/utils/misc/postgresql.conf.sample | 4 ++--
 src/include/port/pg_iovec.h                   | 8 ++++++--
 4 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d1080dac97f..93eea7f96d2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2638,6 +2638,8 @@ include_dir 'conf.d'
          This parameter can only be set in
          the <filename>postgresql.conf</filename> file or on the server
          command line.
+         The maximum possible size depends on the operating system and block
+         size, but is typically 1MB on Unix and 128kB on Windows.
          The default is 128kB.
         </para>
        </listitem>
@@ -2655,6 +2657,8 @@ include_dir 'conf.d'
          higher than the <varname>io_max_combine_limit</varname> parameter, the
          smaller value will silently be used instead, so both may need to be raised
          to increase the I/O size.
+         The maximum possible size depends on the operating system and block
+         size, but is typically 1MB on Unix and 128kB on Windows.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index d65fa07b44c..45bdf819d57 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -515,9 +515,10 @@ read_stream_begin_impl(int flags,
 	 * finishes we don't want to have to wait for its buffers to be consumed
 	 * before starting a new one.
 	 *
-	 * Be careful not to allow int16 to overflow (even though that's not
-	 * possible with the current GUC range limits), allowing also for the
-	 * spare entry and the overflow space.
+	 * Be careful not to allow int16 to overflow.  That is possible with the
+	 * current GUC range limits, so this is an artificial limit of ~32k
+	 * buffers and we'd need to adjust the types to exceed that.  We also have
+	 * to allow for the spare entry and the overflow space.
 	 */
 	max_pinned_buffers = (max_ios + 1) * io_combine_limit;
 	max_pinned_buffers = Min(max_pinned_buffers,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd9a3507135..e43d803b278 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -200,9 +200,9 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
-#io_max_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#io_max_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 					# (change requires restart)
-#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 
 #io_method = sync			# sync (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index d9891d3805d..df40c7208be 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -33,8 +33,12 @@ struct iovec
 
 #endif
 
-/* Define a reasonable maximum that is safe to use on the stack. */
-#define PG_IOV_MAX Min(IOV_MAX, 32)
+/*
+ * Define a reasonable maximum that is safe to use on the stack in arrays of
+ * struct iovec and other small types.  The operating system could limit us to
+ * a number as low as 16, but most systems have 1024.
+ */
+#define PG_IOV_MAX Min(IOV_MAX, 128)
 
 /*
  * Like preadv(), but with a prefix to remind us of a side-effect: on Windows
-- 
2.39.5

v4-0003-Remove-MAX_IO_COMBINE_LIMIT-macro.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Remove-MAX_IO_COMBINE_LIMIT-macro.patchDownload
From c33e3b50be396819a73aca805d33f32d88560ace Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 18 Mar 2025 16:12:43 +1300
Subject: [PATCH v4 3/3] Remove MAX_IO_COMBINE_LIMIT macro.

Now that we have a setting io_max_combine_limit, the macro is a bit
confusing.  It only has a few references, and they can be replaced with
PG_IOV_MAX.
---
 src/backend/storage/buffer/bufmgr.c | 6 +++---
 src/backend/utils/misc/guc_tables.c | 4 ++--
 src/include/storage/bufmgr.h        | 3 +--
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d04afa5ab9c..5e4701d6f19 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1259,7 +1259,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
-	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks <= PG_IOV_MAX);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1452,8 +1452,8 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	for (int i = 0; i < nblocks; ++i)
 	{
 		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		Buffer		io_buffers[PG_IOV_MAX];
+		void	   *io_pages[PG_IOV_MAX];
 		instr_time	io_start;
 		BlockNumber io_first_block;
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 720f28a5180..d9540d31c8e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3251,7 +3251,7 @@ struct config_int ConfigureNamesInt[] =
 		},
 		&io_max_combine_limit,
 		DEFAULT_IO_COMBINE_LIMIT,
-		1, MAX_IO_COMBINE_LIMIT,
+		1, PG_IOV_MAX,
 		NULL, assign_io_max_combine_limit, NULL
 	},
 
@@ -3265,7 +3265,7 @@ struct config_int ConfigureNamesInt[] =
 		},
 		&io_combine_limit,
 		DEFAULT_IO_COMBINE_LIMIT,
-		1, MAX_IO_COMBINE_LIMIT,
+		1, PG_IOV_MAX,
 		NULL, assign_io_combine_limit, NULL
 	},
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 71b3af61b6a..0452fe1f1e8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -161,8 +161,7 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+#define DEFAULT_IO_COMBINE_LIMIT Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;	/* min of the two GUCs below */
 extern PGDLLIMPORT int io_combine_limit_guc;
 extern PGDLLIMPORT int io_max_combine_limit;
-- 
2.39.5

#11Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#10)
Re: Allow io_combine_limit up to 1MB

Hi,

On 2025-03-18 16:18:17 +1300, Thomas Munro wrote:

Here's a new version that also adjusts the code that just landed in
da722699:

Something isn't quite right with this code. If I just add -c
io_combine_limit=32 to the options and do a seqscan, I get odd
failures. Mostly assertion failures about buffers not being valid in
CheckReadBuffersOperation().

Sure glad I added CheckReadBuffersOperation(), that'd have been much nastier
to figure out without those assertion failures.

A bit of debugging later: Ie figured out that this is because io_combine_limit
is bigger than io_max_combine_limit, so the iovecs of one IO overlap with
another. With predictably hilarious results.

I think it might be a small thing:
Since our GUC system doesn't support dependencies or cross-checks
between GUCs, the user-settable one now assigns a "raw" value to
io_combine_limit_guc, and the lower of io_combine_limit_guc and
io_max_combine_limit is maintained in io_combine_limit.

However, the IO combine limit GUC still references io_combine_limit:

{
{"io_combine_limit",
PGC_USERSET,
RESOURCES_IO,
gettext_noop("Limit on the size of data reads and writes."),
NULL,
GUC_UNIT_BLOCKS
},
&io_combine_limit,
DEFAULT_IO_COMBINE_LIMIT,
1, MAX_IO_COMBINE_LIMIT,
NULL, assign_io_combine_limit, NULL
},

Therefore the GUC machinery undoes the work of io_combine_limit done in
assign_io_combine_limit:

/*
* GUC assign hooks that recompute io_combine_limit whenever
* io_combine_limit_guc and io_max_combine_limit are changed. These are needed
* because the GUC subsystem doesn't support dependencies between GUCs, and
* they may be assigned in either order.
*/
void
assign_io_max_combine_limit(int newval, void *extra)
{
io_max_combine_limit = newval;
io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
}
void
assign_io_combine_limit(int newval, void *extra)
{
io_combine_limit_guc = newval;
io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
}

So we end up with a larger io_combine_limit than
io_max_combine_limit. Hilarity ensues.

Besides the obvious fix, I think we should also add
Assert(len <= io_max_combine_limit);

to pgaio_io_set_handle_data_32/64, that'd have made the bug much more obvious.

Greetings,

Andres Freund

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: Allow io_combine_limit up to 1MB

Andres Freund <andres@anarazel.de> writes:

void
assign_io_max_combine_limit(int newval, void *extra)
{
io_max_combine_limit = newval;
io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
}
void
assign_io_combine_limit(int newval, void *extra)
{
io_combine_limit_guc = newval;
io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
}

So we end up with a larger io_combine_limit than
io_max_combine_limit. Hilarity ensues.

There's another thing that's rather tin-eared about these assign
hooks: the hook is not supposed to be setting the GUC variable by
itself. guc.c will do that. It's relatively harmless for these
int-valued GUCs to be assigned twice, but I think it might be
problematic if this coding pattern were followed for a string GUC.
I suggest instead

void
assign_io_max_combine_limit(int newval, void *extra)
{
io_combine_limit = Min(newval, io_combine_limit_guc);
}

void
assign_io_combine_limit(int newval, void *extra)
{
io_combine_limit = Min(io_max_combine_limit, newval);
}

Besides the obvious fix, I think we should also add
Assert(len <= io_max_combine_limit);

+1

regards, tom lane

#13Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#12)
Re: Allow io_combine_limit up to 1MB

Hi,

On 2025-04-25 10:26:09 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

void
assign_io_max_combine_limit(int newval, void *extra)
{
io_max_combine_limit = newval;
io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
}
void
assign_io_combine_limit(int newval, void *extra)
{
io_combine_limit_guc = newval;
io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
}

So we end up with a larger io_combine_limit than
io_max_combine_limit. Hilarity ensues.

There's another thing that's rather tin-eared about these assign
hooks: the hook is not supposed to be setting the GUC variable by
itself.

Agreed. Without that the bug would have been more apparent... Pushed the fix,
alongside the change you outlined.

It's kinda sad to not have any test that tests a larger
io_combine_limit/io_max_combine_limit - as evidenced by this bug that'd be
good. However, not all platforms have PG_IOV_MAX > 16, so it seems like it'd
be somewhat painful to test?

Greetings,

Andres Freund

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#13)
Re: Allow io_combine_limit up to 1MB

Andres Freund <andres@anarazel.de> writes:

It's kinda sad to not have any test that tests a larger
io_combine_limit/io_max_combine_limit - as evidenced by this bug that'd be
good. However, not all platforms have PG_IOV_MAX > 16, so it seems like it'd
be somewhat painful to test?

Maybe just skip the test if the maximum value of the GUC isn't
high enough?

regards, tom lane

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#14)
Re: Allow io_combine_limit up to 1MB

On Sat, Apr 26, 2025 at 5:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

It's kinda sad to not have any test that tests a larger
io_combine_limit/io_max_combine_limit - as evidenced by this bug that'd be
good. However, not all platforms have PG_IOV_MAX > 16, so it seems like it'd
be somewhat painful to test?

Maybe just skip the test if the maximum value of the GUC isn't
high enough?

We could also change IOV_MAX and thus PG_IOV_MAX to (say) 32 on
Windows, if it's useful for testing. It's not real, I just made that
number up when writing pwritev/preadv replacements, and POSIX says
that 16 is the minimum a system should support. I have patches lined
up to add real vectored I/O for Windows, and then the number will
change anyway, probably harmonizing so that it works out to 1MB
everywhere in practice. If it's useful to change it now for a test
then I don't know any reason why not. The idea of the maximally
conservative 16 was not to encourage people to set it to high numbers
while it's emulated, but it's not especially important.

Unixen have converged on IOV_MAX == 1024, most decades ago. I think
AIX might be a hold-out, but we don't currently care about that, and
Solaris only moved 16->1024 recently. If we change the fake numbers
made up for Windows, say 16->32, then I suspect that would leave just
one single machine in the 'farm that would skip the test if I
understood the proposal correctly: margay, a Solaris box not receiving
OS updates and thus missing "SRU72".

Sorry for screwing up the GUC, it looks like I completely goofed on
the contract for GUC assign functions! They aren't in charge of
assigning, they're just called *on* assignment. Whoops. And thanks
for fixing it.