AIO / read stream heuristics adjustments for index prefetching

Started by Andres Freund2 months ago23 messageshackers

andres@anarazel.de

2 months ago

Hi,

The index prefetching patchset [1]/messages/by-id/CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com contains a few adjustments to the read
stream logic for readahead. It seemed better to discuss them separately than
in that already very large thread.

The first two patches are also a dependency of the explain read stream
patches [2]/messages/by-id/6f541abf-f9e1-4830-93cc-4a849dbf2ecf@vondra.me.

There are two main areas that prefetching of table data during an index scan
is more sensitive to than existing read stream users:

1) Prefetching for index scans is much more sensitive to doing too aggressive
read ahead due to plans that involve running index scans to partial
completion, rather than full completion. Consider e.g. nestloop antijoins
or such, where the scan on the inner side will be started but often not
completed. If we unnecessarily read ahead too aggressively, a lot of IO
could be wasted.

While it's of course possible to have partially consumed read streams with
sequential scans or bitmap heap scans, it's not as common / cost sensitive.

For seqscans it likely mostly happens with a LIMIT above the seqscan, but
that probably won't be happening many times within a query on a table of
any size.

For bitmap heap scans it's not as common because the startup cost, i.e. the
building of the bitmap, is far from cheap, doing that over and over does
not make a lot of sense.

2) Prefetching for index scans is much more likely to have complicated mixes
of hits and misses.

Whereas a seqscan or a bitmap heap scan accesses each table block exactly
once, with index scans its very common to have repeated accesses to some
table blocks, while still having misses on other blocks. This means that
index scans are more sensitive to patterns of hits and misses decreasing
the readahead distance so much that we don't do aggressive enough readahead
to avoid waiting for IO anymore.

While more pronounced with index prefetching, it was already an issue with
the existing users, particularly for bitmap heap scans. In fact, a similar
patch to what's included here was first discussed somewhere around the BHS
prefetching work.

There's a few sets of changes here:

0001+0002: Return whether WaitReadBuffers() needed to wait

The first patch allows pgaio_wref_check_done() to work more reliably with
io_uring. Until now it only was able to return true if userspace already
had consumed the kernel's completion event, but returned false otherwise.
That's not really incorrect, just suboptimal.

The second patch returns whether WaitReadBuffers() needed to wait for
IO. This is useful for a) instrumentation like in [2]/messages/by-id/6f541abf-f9e1-4830-93cc-4a849dbf2ecf@vondra.me and b) to provide
information to the read_stream heuristics to control how aggressive to
perform read ahead.

0003: read_stream: Issue IO synchronously while in fast path

When read stream is in fast path mode (where it short-circuits the read
ahead logic, to reduce CPU overhead in s_b resident workloads) and
encounters a miss, we until now performed the read asynchronously.

Unfortunately, with worker, that can lead to slowdowns, because
dispatching to workers has a latency impact. When doing "real" readahead,
that's a price worth paying, because the latency should be hidden by
issuing the reads early enough. But when just coming out of fast path
mode, we're not ahead of what's needed, so the dispatch latency can't be
hidden.

We already have infrastructure to mark IOs to be executed
synchronously. So we just need to use that here.

0004: read_stream: Prevent distance from decaying too quickly

This, quite simple, patch reduces issue 2) from above, by preventing the
look-ahead distance from being reduced for #maximum lookahead distance
blocks after each miss. While this may seem overly aggressive, a single
effectively synchronous read can take a long time compared to the CPU time
needed for processing pages hits. On cloud storage the IO latency is
somewhere between 0.5ms and 4ms. A halfway modern CPU can do a few
heap_hot_search_buffer()s on 1000s of pages within 1 ms.

While this one is my patch, several others have written variations of it
before. We should probably have committed one already.

There are two minor questions here:
- Should read_stream_pause()/read_stream_resume() restore the "holdoff"
counter? I doubt it matters for the prospective user, since it will
only be used when the lookahead distance is very large.

- For how long to hold off distance reductions? Initially I was torn
between using "max_pinned_buffers" (Min(max_ios * io_combine_limit,
cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I
think the former makes more sense, as we otherwise won't allow for far
enough readahead when doing IO combining, and it does seem to make sense
to hold off decay for long enough that the maximum lookahead could not
theoretically allow us to start an IO.

0005+0006: Only increase distance when waiting for IO

Until now we have increased the read ahead distance whenever there we
needed to do IO (doubling the distance every miss). But that will often be
way too aggressive, with the IO subsystem being able to keep up with a
much lower distance.

The idea here is to use information about whether we needed to wait for IO
before returning the buffer in read_stream_next_buffer() to control
whether we should increase the readahead distance.

This seems to work extremely well for worker.

Unfortuntely with io_uring the situation is more complicated, because
io_uring performs reads synchronously during submission if the data is the
kernel page cache. This can reduce performance substantially compared to
worker, because it prevents parallelizing the copy from the page cache.
There is an existing heuristic for that in method_io_uring.c that adds a
flag to the IO submissions forcing the IO to be processed asynchronously,
allowing for parallelism. Unfortunately the heuristic is triggered by the
number of IOs in flight - which will never become big enough to tgrigger
after using "needed to wait" to control how far to read ahead.

So 0005 expands the io_uring heuristic to also trigger based on the sizes
of IOs - but that's decidedly not perfect, we e.g. have some experiments
showing it regressing some parallel bitmap heap scan cases. It may be
better to somehow tweak the logic to only trigger for worker.

As is this has another issue, which is that it prevents IO combining in
situations where it shouldn't, because right now using the distance to
control both. See 0008 for an attempt at splitting those concerns.

0007: Make read_stream_reset()/end() not wait for IO

This is a quite experimental, not really correct as-is, patch to avoid
unnecessarily waiting for in-flight IO when read_stream_reset() is done
while there's in-flight IO. This is useful for things like nestloop
antioins with quals on the inner side (without the qual we'd not trigger
any readahead, as that's deferred in the index prefetching patch).

As-is this will leave IOs visible in pg_aios for a while, potentially
until the backends exit. That's not right.

0008: WIP: read stream: Split decision about look ahead for AIO and combining

Until now read stream has used a single look-ahead distance to control
lookahead for both IO combining and read-ahead. That's sub-optimal, as we
want to do IO combining even when we don't need to do any readahead, as
avoiding the syscall overhead is important to reduce CPU overhead when
data is in the kernel page cache.

This is a prototype for what it could look like to split those
decisions. Thereby fixing the regression mentioned in 0006.

One thing that's really annoying around this is that we have no infrastructure
for testing that the heuristics keep working. It's very easy to improve one
thing while breaking something else, without noticing, because everything
keeps working.

I'm wondering about something like a READ_STREAM_DEBUG_INSTRUMENT flag which
would trigger providing information about the IOs and their schedule via the
the per-buffer-data mechanism. That would allow test_aio's
read_stream_for_blocks() to return that information, which in turn could be
used to verify that we are doing IO combining and looking ahead far enough in
some situations.

Greetings,

Andres Freund

[1]: /messages/by-id/CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com
[2]: /messages/by-id/6f541abf-f9e1-4830-93cc-4a849dbf2ecf@vondra.me

Melanie Plageman

melanieplageman@gmail.com

2 months ago

In reply to: Andres Freund (#1)

Re: AIO / read stream heuristics adjustments for index prefetching

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:

0001+0002: Return whether WaitReadBuffers() needed to wait

The first patch allows pgaio_wref_check_done() to work more reliably with
io_uring. Until now it only was able to return true if userspace already
had consumed the kernel's completion event, but returned false otherwise.
That's not really incorrect, just suboptimal.

The second patch returns whether WaitReadBuffers() needed to wait for
IO. This is useful for a) instrumentation like in [2] and b) to provide
information to the read_stream heuristics to control how aggressive to
perform read ahead.

These both look good to me except in 0001 you left an XXX in
pgaio_wref_check_done() that I think is the very thing that commit
does.

0003: read_stream: Issue IO synchronously while in fast path

LGTM

0004: read_stream: Prevent distance from decaying too quickly

There are two minor questions here:
- Should read_stream_pause()/read_stream_resume() restore the "holdoff"
counter? I doubt it matters for the prospective user, since it will
only be used when the lookahead distance is very large.

I don't really understand this. We have to do this with distance
because we set it to 0 and use distance == 0 to indicate stream end.
read_stream_pause() doesn't set the distance_decay_holoff to 0. If you
mean, should we reset holdoff to its initial value, then I don't think
so. I imagine that users doing a lot of pause and resume may not have
high distance.

- For how long to hold off distance reductions? Initially I was torn
between using "max_pinned_buffers" (Min(max_ios * io_combine_limit,
cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I
think the former makes more sense, as we otherwise won't allow for far
enough readahead when doing IO combining, and it does seem to make sense
to hold off decay for long enough that the maximum lookahead could not
theoretically allow us to start an IO.

I agree. 0004 LGTM otherwise.

- Melanie

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Andres Freund (#1)

Re: AIO / read stream heuristics adjustments for index prefetching

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:

0005+0006: Only increase distance when waiting for IO

In "aio: io_uring: Trigger async processing for large IOs" (0005), the
first sentence of the commit message is incomplete.
Is there any reason for both the io size and inflight IOs threshold to
be 4? If they should be the same, I think it would be better if this
was a macro.

This may not matter, but the old code checked in_flight_before > 5
before incrementing if for the current IO. The new code counts it
after pushing the current IO onto the submission list. So the new way
is slightly more aggressive.

0006 "(read_stream: Only increase distance when waiting for IO)" looks
good to me from a code perspective. I don't yet have ideas for
handling potential parallel bitmapheapscan regressions.

Unfortuntely with io_uring the situation is more complicated, because
io_uring performs reads synchronously during submission if the data is the
kernel page cache. This can reduce performance substantially compared to
worker, because it prevents parallelizing the copy from the page cache.
There is an existing heuristic for that in method_io_uring.c that adds a
flag to the IO submissions forcing the IO to be processed asynchronously,
allowing for parallelism. Unfortunately the heuristic is triggered by the
number of IOs in flight - which will never become big enough to tgrigger
after using "needed to wait" to control how far to read ahead.

So 0005 expands the io_uring heuristic to also trigger based on the sizes
of IOs - but that's decidedly not perfect, we e.g. have some experiments
showing it regressing some parallel bitmap heap scan cases. It may be
better to somehow tweak the logic to only trigger for worker.

Trigger which logic only for worker, you mean only increasing the
distance when waiting?

As is this has another issue, which is that it prevents IO combining in
situations where it shouldn't, because right now using the distance to
control both. See 0008 for an attempt at splitting those concerns.

Even if you can't combine into a single IO, it seems like a low
distance is problematic because it degrades batching and causes us to
have to call io_uring_enter for every block (I think). At least when I
was experimenting with this, the syscall overhead seemed
non-negligible. It's also true that this meant the memcpys couldn't be
parallelized, but system call overhead also seems to have been a
factor.

Setting aside more complicated prefetching systems, what it seems like
we are saying is that for all "miss" cases (not in SB) a distance of
above 1 is advantageous (unless we are only doing 1 IO). I wonder if
there is something hacky we can do like not decaying distance below
io_combine_limit if there has been a recent miss or growing it up to
at least io_combine_limit if we aren't getting all hits.

0007: Make read_stream_reset()/end() not wait for IO

This is a quite experimental, not really correct as-is, patch to avoid
unnecessarily waiting for in-flight IO when read_stream_reset() is done
while there's in-flight IO. This is useful for things like nestloop
antioins with quals on the inner side (without the qual we'd not trigger
any readahead, as that's deferred in the index prefetching patch).

As-is this will leave IOs visible in pg_aios for a while, potentially
until the backends exit. That's not right.

Separating the problems: the handle slot exhaustion seems like it
could be solved by having the backend process discard IOs when it
needs one and there isn't any. Or is that not work we want to do in a
hot path?

The pg_aios view problems seem solvable with a flag on the IO like
"DISCARDED". But the buffers staying pinned is different. It seems
like you'll need the backend to process the discarded IOs at some
point. Maybe it should do that before idling waiting for input?

When discarding IOs, I don't really understand why the foreign IO
path, doesn't just clear its own wait ref (not the buffer descriptor
one) and move on -- instead you have it wait.

I haven't finished reviewing 0008 yet.

One thing that's really annoying around this is that we have no infrastructure
for testing that the heuristics keep working. It's very easy to improve one
thing while breaking something else, without noticing, because everything
keeps working.

Agreed that something here would be useful.

- Melanie

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Andres Freund (#1)

Re: AIO / read stream heuristics adjustments for index prefetching

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:

0008: WIP: read stream: Split decision about look ahead for AIO and combining

Until now read stream has used a single look-ahead distance to control
lookahead for both IO combining and read-ahead. That's sub-optimal, as we
want to do IO combining even when we don't need to do any readahead, as
avoiding the syscall overhead is important to reduce CPU overhead when
data is in the kernel page cache.

This is a prototype for what it could look like to split those
decisions. Thereby fixing the regression mentioned in 0006.

I wonder if we need to keep the combine_limit member in the read
stream. Could we just use io_combine_limit without ramping up and
down? This is mainly for code complexity reasons. Perhaps to allow
fast path reentry, we could use distance_decay_holdoff == 0 and
ios_in_progress == 0 instead of combine_distance == 0.

- Melanie

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#4)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-04-01 10:52:03 -0400, Melanie Plageman wrote:

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:

0008: WIP: read stream: Split decision about look ahead for AIO and combining

Until now read stream has used a single look-ahead distance to control
lookahead for both IO combining and read-ahead. That's sub-optimal, as we
want to do IO combining even when we don't need to do any readahead, as
avoiding the syscall overhead is important to reduce CPU overhead when
data is in the kernel page cache.

This is a prototype for what it could look like to split those
decisions. Thereby fixing the regression mentioned in 0006.

I wonder if we need to keep the combine_limit member in the read
stream. Could we just use io_combine_limit without ramping up and
down? This is mainly for code complexity reasons.

I thought so at first too, but it unfortunately leads to substantial
regressions with index prefetching, due to reading ahead unnecessarily far in
cases where we really just needed one block.

Perhaps to allow fast path reentry, we could use distance_decay_holdoff == 0
and ios_in_progress == 0 instead of combine_distance == 0.

Somewhat orthogonal: I really dislike the code to re-enter fastpath. I've now
broken it a few times without noticing. Especially when using a lower
distance, it's easy for the gating conditions to be fulfilled if
read_stream_look_ahead() decided to not *yet* do look ahead, because there's
still a pinned buffer and the distance is low.

ISTM that it really should only be checked after we did a lookahead and found
it to be a buffer hit.

Greetings,

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#3)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-03-31 16:59:14 -0400, Melanie Plageman wrote:

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:

0005+0006: Only increase distance when waiting for IO

In "aio: io_uring: Trigger async processing for large IOs" (0005), the
first sentence of the commit message is incomplete.

Oops.

Is there any reason for both the io size and inflight IOs threshold to
be 4? If they should be the same, I think it would be better if this
was a macro.

No, there's no real real they have to be the same. It's just what a bit of
experimenting showed working independently for either.

This may not matter, but the old code checked in_flight_before > 5
before incrementing if for the current IO. The new code counts it
after pushing the current IO onto the submission list. So the new way
is slightly more aggressive.

Hm. True. Not sure it matters. I didn't really see a significant difference
for anything between 3 and 7, it was only outside of that that I saw worse
performance.

I have done more validation with the new cutoff value than with the old one,
so I'm ever so mildly inclined to use the value currently in the patch, but I
won't at all insist on it.

Unfortuntely with io_uring the situation is more complicated, because
io_uring performs reads synchronously during submission if the data is the
kernel page cache. This can reduce performance substantially compared to
worker, because it prevents parallelizing the copy from the page cache.
There is an existing heuristic for that in method_io_uring.c that adds a
flag to the IO submissions forcing the IO to be processed asynchronously,
allowing for parallelism. Unfortunately the heuristic is triggered by the
number of IOs in flight - which will never become big enough to tgrigger
after using "needed to wait" to control how far to read ahead.

So 0005 expands the io_uring heuristic to also trigger based on the sizes
of IOs - but that's decidedly not perfect, we e.g. have some experiments
showing it regressing some parallel bitmap heap scan cases. It may be
better to somehow tweak the logic to only trigger for worker.

Trigger which logic only for worker, you mean only increasing the
distance when waiting?

Yea.

As is this has another issue, which is that it prevents IO combining in
situations where it shouldn't, because right now using the distance to
control both. See 0008 for an attempt at splitting those concerns.

Even if you can't combine into a single IO, it seems like a low
distance is problematic because it degrades batching and causes us to
have to call io_uring_enter for every block (I think).

I don't think it actually does change the situation around that significantly,
because we already end up with "too few" IOs once we hit the distance maximum,
as we'll submit another IO as soon as we can.

I think we will eventually need some logic to only start submitting again once
multiple IOs are possible. But that's another set of heuristics, so a project
for another day :)

In my experiments the batching was primarily useful to allow to reduce the
command submission & interrupt overhead when interacting with storage,
i.e. when actually doing IO, not just copying from the page cache.

I have seen it help due to reducing syscalls too, but the amount of batching
and/or combining seems to have a relatively low ceiling at which it stops
helping.

Setting aside more complicated prefetching systems, what it seems like
we are saying is that for all "miss" cases (not in SB) a distance of
above 1 is advantageous (unless we are only doing 1 IO). I wonder if
there is something hacky we can do like not decaying distance below
io_combine_limit if there has been a recent miss or growing it up to
at least io_combine_limit if we aren't getting all hits.

I think it's true that if IO execution was all that mattered, we would want a
bit more IO in flight at all time. However looking ahead quite deeply also
has costs:

1) The resowner and private refcount mechanisms take more CPU cycles if you
have more buffers pinned

2) The CPU cache hit ratio goes down if there's a longer time between copying
data into s_b and consuming it

3) If you have a scan that won't be consumed to completion, you're wasting
more the deeper you look ahead

This is actually not hard to show:

SET max_parallel_workers_per_gather = 0;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM pgbench_accounts WHERE abalance = 3;
(on a tree that has the explain stuff, but not the patches in this thread applied)

eic time in ms
1 1326.525
2 1325.240
4 1335.073
8 1343.440
16 1346.189
32 1356.598
64 1398.326
128 1635.081
256 1674.685
512 1677.264
1000 1680.050

This one mainly shows 2) from above, I think, but the others are measurable in
other workloads.

0007: Make read_stream_reset()/end() not wait for IO

This is a quite experimental, not really correct as-is, patch to avoid
unnecessarily waiting for in-flight IO when read_stream_reset() is done
while there's in-flight IO. This is useful for things like nestloop
antioins with quals on the inner side (without the qual we'd not trigger
any readahead, as that's deferred in the index prefetching patch).

As-is this will leave IOs visible in pg_aios for a while, potentially
until the backends exit. That's not right.

Separating the problems: the handle slot exhaustion seems like it
could be solved by having the backend process discard IOs when it
needs one and there isn't any.

I don't think it could lead to exhaustion of handles, pgaio_io_acquire() will
call pgaio_io_wait_for_free() which will wait for the oldest IO.

The pg_aios view problems seem solvable with a flag on the IO like
"DISCARDED". But the buffers staying pinned is different. It seems
like you'll need the backend to process the discarded IOs at some
point. Maybe it should do that before idling waiting for input?

The easiest way would be to actually leave the IOs registered with the
resowner and have it wait for completion at command or transaction end if not
already done. But we currently don't really do that with resowners in the
!abort case. I'm not sure if anybody would mind doing differently here.

Another approach would be to do it in AtEOXact_Aio(), but that would mean the
IOs could hang around for a while.

A third approach could be to do one of the above, but add some additional that
go through in flight IOs and check if they completed, e.g. in
pgaio_io_acquire_nb().

When discarding IOs, I don't really understand why the foreign IO
path, doesn't just clear its own wait ref (not the buffer descriptor
one) and move on -- instead you have it wait.

That'd be ok to do, I just didn't want to think about that more complicated
and less common case :)

I haven't finished reviewing 0008 yet.

I've attached a version of what was 0008 split into two, one to introduce the
new helpers, one to introduce the separate combine distance.

I've pushed what was 0001 and 0002. Will push the former 0003 shortly.

Greetings,

Andres Freund

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 months ago

In reply to: Andres Freund (#6)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On Thu, 2 Apr 2026 at 02:20, Andres Freund <andres@anarazel.de> wrote:

I've pushed what was 0001 and 0002. Will push the former 0003 shortly.

0001 LGTM.

0002: read_stream: Prevent distance from decaying too quickly

+        /*
+         * As we needed IO, prevent distance from being reduced within our
+         * maximum look-ahead window. This avoids having distance collapse too
+         * quickly in workloads where most of the required blocks are cached,
+         * but where the remaining IOs are a sufficient enough factor to cause
+         * a substantial slowdown if executed synchronously.
+         *
+         * There are valid arguments for preventing decay for max_ios or for
+         * max_pinned_buffers.  But the argument for max_pinned_buffers seems
+         * clearer - if we can't see any misses within the maximum look-ahead
+         * distance, we can't do any useful read-ahead.
+         */
+        stream->distance_decay_holdoff = stream->max_pinned_buffers;

That is already committed but I have a question. Did you think about
setting stream->distance_decay_holdoff to current stream->distance?
This will also fix 'miss followed by a hit' (it won't fix double
missed followed by a hit, though).

0003: aio: io_uring: Trigger async processing for large IOs

There is a small optimization opportunity, we don't need to calculate
io_size for the DIO since pgaio_uring_should_use_async() will always
return false. Do you think it is worth implementing this? Other than
that, LGTM.

0004: read_stream: Only increase distance when waiting for IO

LGTM.

0005: WIP: read_stream: Move logic about IO combining & issuing to helpers

/* never pin more buffers than allowed */
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)
return false;

/*
* Don't start more read-ahead if that'd put us over the distance limit
* for doing read-ahead.
*/
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->distance)
return false;

AFAIK, stream->distance can't be higher than
stream->max_pinned_buffers [1], so I think we can remove the first if
block. If we are not sure about [1], stream->max_pinned_buffers most
likely will be higher than stream->distance, it would make sense to
change the order of these.

Aha, I understood this change after looking at 0006. It is nitpick but
'if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)' block can be added in 0006. Other than
these, LGTM.

0006: WIP: read stream: Split decision about look ahead for AIO and combining

I liked the idea of being more aggressive to do IO combining. What is
the reason for gradually increasing combine_distance, is it to not do
unnecessary IOs at the start?

+                /*
+                 * XXX: Should we actually reduce this at any time other than
+                 * a reset? For now we have to, as this is also a condition
+                 * for re-enabling fast_path.
+                 */
+                if (stream->combine_distance > 1)
+                    stream->combine_distance--;

I don't think we need to reduce this other than reset.

+    /*
+     * Allow looking further ahead if we have an the process of building a
+     * larger IO, the IO is not yet big enough and we don't yet have IO in
+     * flight.  Note that this is allowed even if we are reaching the
+     * read-ahead limit (but not the buffer pin limit).
+     *
+     * This is important for cases where either effective_io_concurrency is
+     * low or we never need to wait for IO and thus are not increasing the
+     * distance. Without this we would end up with lots of small IOs.
+     */
+    if (stream->pending_read_nblocks > 0 &&
+        stream->pinned_buffers == 0 &&
+        stream->pending_read_nblocks < stream->combine_distance)
+        return true;

Do we need to check 'stream->pinned_buffers == 0' here? I think we can
continue to look ahead although there are pinned buffers as we already
know 'stream->pinned_buffers + stream->pending_read_nblocks <
stream->max_pinned_buffers'.

+    /* same if capped not by io_combine_limit but combine_distance */
+    if (stream->combine_distance > 0 &&
+        pending_read_nblocks >= stream->combine_distance)
+        return true;

I think we don't need to check 'stream->combine_distance > 0', it
shouldn't be less or equal than zero when the
'stream->readahead_distance' is not zero, right?

+    {
+        stream->readahead_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+        stream->combine_distance = stream->io_combine_limit;
+    }

I think the stream->combine_distance should be equal to
stream->readahead_distance as we don't want to surpass
max_pinned_buffers.

0007: Hacky implementation of making read_stream_reset()/end() not wait for IO

Read stream parts LGTM. I don't have a comment on the FIXME part, though.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Nazir Bilal Yavuz (#7)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-04-02 15:30:26 +0300, Nazir Bilal Yavuz wrote:

0002: read_stream: Prevent distance from decaying too quickly

+        /*
+         * As we needed IO, prevent distance from being reduced within our
+         * maximum look-ahead window. This avoids having distance collapse too
+         * quickly in workloads where most of the required blocks are cached,
+         * but where the remaining IOs are a sufficient enough factor to cause
+         * a substantial slowdown if executed synchronously.
+         *
+         * There are valid arguments for preventing decay for max_ios or for
+         * max_pinned_buffers.  But the argument for max_pinned_buffers seems
+         * clearer - if we can't see any misses within the maximum look-ahead
+         * distance, we can't do any useful read-ahead.
+         */
+        stream->distance_decay_holdoff = stream->max_pinned_buffers;

I am fairly certain that would not work well, because the whole point of the
mechanism it to prevent the distance from staying too small when the distance
is low.

We want the distance to grow for cases like mis, hit{10}, miss, ... because
that will typically still be bottlenecked by IO latency. And something like
setting the decay holdoff to the current distance wouldn't work for that.

0003: aio: io_uring: Trigger async processing for large IOs

There is a small optimization opportunity, we don't need to calculate
io_size for the DIO since pgaio_uring_should_use_async() will always
return false. Do you think it is worth implementing this? Other than
that, LGTM.

I doubt it. Compared to actually doing IO the cost of this is neglegible.

0005: WIP: read_stream: Move logic about IO combining & issuing to helpers

/* never pin more buffers than allowed */
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)
return false;

/*
* Don't start more read-ahead if that'd put us over the distance limit
* for doing read-ahead.
*/
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->distance)
return false;

AFAIK, stream->distance can't be higher than
stream->max_pinned_buffers [1], so I think we can remove the first if
block. If we are not sure about [1], stream->max_pinned_buffers most
likely will be higher than stream->distance, it would make sense to
change the order of these.

Aha, I understood this change after looking at 0006. It is nitpick but
'if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)' block can be added in 0006. Other than
these, LGTM.

Hm. I guess that could make sense.

0006: WIP: read stream: Split decision about look ahead for AIO and combining

I liked the idea of being more aggressive to do IO combining. What is
the reason for gradually increasing combine_distance, is it to not do
unnecessary IOs at the start?

Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
hurt with index scan prefetching, because of query plans where we only consume
part of the index scan (e.g. a nested loop antijoin).

+                /*
+                 * XXX: Should we actually reduce this at any time other than
+                 * a reset? For now we have to, as this is also a condition
+                 * for re-enabling fast_path.
+                 */
+                if (stream->combine_distance > 1)
+                    stream->combine_distance--;

I don't think we need to reduce this other than reset.

Hm. I go back and forth on that one :)

+    /*
+     * Allow looking further ahead if we have an the process of building a
+     * larger IO, the IO is not yet big enough and we don't yet have IO in
+     * flight.  Note that this is allowed even if we are reaching the
+     * read-ahead limit (but not the buffer pin limit).
+     *
+     * This is important for cases where either effective_io_concurrency is
+     * low or we never need to wait for IO and thus are not increasing the
+     * distance. Without this we would end up with lots of small IOs.
+     */
+    if (stream->pending_read_nblocks > 0 &&
+        stream->pinned_buffers == 0 &&
+        stream->pending_read_nblocks < stream->combine_distance)
+        return true;

We could, but I don't think there would be a benefit in doing so. In my mind,
what combine_distance is used for, is to control how far to look ahead to
allow for IO combining when the readahead_distance is too low to allow for IO
combining. But if pinned_buffers > 0, we already have another IO in flight,
so the combine_distance mechanism doesn't have to kick in.

I've been experimenting with going the other way, by having
read_stream_should_look_ahead() do a check like

/*
* If we already have IO in flight, but are close enough to to the
* distance limit that we would not start a fully sized IO, don't even
* start a pending read until later.
*
* This avoids calling the call thing the next block callback in cases we
* would not start the pending read anyway. For some users the work to
* determine the next block is non-trivial, so we don't want to do so
* earlier than necessary.
*
* A secondary benefit of this is that some callers use parallel workers
* with each their own read stream to process a global list of blocks, and
* only calling the next block callback when ready to actually issue IO
* makes it more likely for one backend to get consecutive blocks.
*/
if (stream->pinned_buffers > 0 &&
stream->pending_read_nblocks == 0 &&
stream->pinned_buffers + stream->combine_distance >= stream->readahead_distance)
return false;

I do see that improve performance with some bitmap heap scan queries
(e.g. Tomas' "cyclic" ones). But I'm worried it'd hurt other queries when
using a low effective_io_concurrency. So that's probably something for later.

+    /* same if capped not by io_combine_limit but combine_distance */
+    if (stream->combine_distance > 0 &&
+        pending_read_nblocks >= stream->combine_distance)
+        return true;
I think we don't need to check 'stream->combine_distance > 0', it
shouldn't be less or equal than zero when the
'stream->readahead_distance' is not zero, right?

Yea. I guess we could just assert that.

+    {
+        stream->readahead_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+        stream->combine_distance = stream->io_combine_limit;
+    }
I think the stream->combine_distance should be equal to
stream->readahead_distance as we don't want to surpass
max_pinned_buffers.

I don't think it could lead to exceeding max_pinned_buffers, due to the
/* never pin more buffers than allowed */

check in read_stream_should_look_ahead(), but there's no reason to rely on
that.

Greetings,

Andres Freund

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Andres Freund (#1)

Re: AIO / read stream heuristics adjustments for index prefetching

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:

0005+0006: Only increase distance when waiting for IO

Until now we have increased the read ahead distance whenever there we
needed to do IO (doubling the distance every miss). But that will often be
way too aggressive, with the IO subsystem being able to keep up with a
much lower distance.

The idea here is to use information about whether we needed to wait for IO
before returning the buffer in read_stream_next_buffer() to control
whether we should increase the readahead distance.

This seems to work extremely well for worker.

Unfortuntely with io_uring the situation is more complicated, because
io_uring performs reads synchronously during submission if the data is the
kernel page cache. This can reduce performance substantially compared to
worker, because it prevents parallelizing the copy from the page cache.
There is an existing heuristic for that in method_io_uring.c that adds a
flag to the IO submissions forcing the IO to be processed asynchronously,
allowing for parallelism. Unfortunately the heuristic is triggered by the
number of IOs in flight - which will never become big enough to tgrigger
after using "needed to wait" to control how far to read ahead.

On some level, relying on worker mode overhead feels fragile. If
worker overhead decreases—say, by moving to IO worker threads—we won't
be able to rely on this to keep the distance to an advantageous level.

If io_uring async copying is advantageous even when the consumer never
needs to wait, then it seems like parallelizing copying to/from the
kernel buffer cache will always be advantageous to do at some level.

The case where it is not (as you've stated before) is when the
consumer doesn't need the extra blocks, so it is just wasted time
spent acquiring them.

So, it feels odd to try and find a heuristic that allows the readahead
distance to increase even when the consumer is not having to wait. I'm
not saying we should do this for this release, but I'm just wondering
if in the medium term, we should try to find a better way to identify
the situation where async processing is not beneficial because the
blocks won't be needed.

So 0005 expands the io_uring heuristic to also trigger based on the sizes
of IOs - but that's decidedly not perfect, we e.g. have some experiments
showing it regressing some parallel bitmap heap scan cases. It may be
better to somehow tweak the logic to only trigger for worker.

As is this has another issue, which is that it prevents IO combining in
situations where it shouldn't, because right now using the distance to
control both. See 0008 for an attempt at splitting those concerns.

Yea, I think running ahead far enough to get bigger IOs needs to
happen and can't be based on the consumer having to wait.

- Melanie

#10

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#9)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-04-02 10:31:50 -0400, Melanie Plageman wrote:

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:

0005+0006: Only increase distance when waiting for IO

Until now we have increased the read ahead distance whenever there we
needed to do IO (doubling the distance every miss). But that will often be
way too aggressive, with the IO subsystem being able to keep up with a
much lower distance.

The idea here is to use information about whether we needed to wait for IO
before returning the buffer in read_stream_next_buffer() to control
whether we should increase the readahead distance.

This seems to work extremely well for worker.

Unfortuntely with io_uring the situation is more complicated, because
io_uring performs reads synchronously during submission if the data is the
kernel page cache. This can reduce performance substantially compared to
worker, because it prevents parallelizing the copy from the page cache.
There is an existing heuristic for that in method_io_uring.c that adds a
flag to the IO submissions forcing the IO to be processed asynchronously,
allowing for parallelism. Unfortunately the heuristic is triggered by the
number of IOs in flight - which will never become big enough to tgrigger
after using "needed to wait" to control how far to read ahead.

On some level, relying on worker mode overhead feels fragile. If
worker overhead decreases—say, by moving to IO worker threads—we won't
be able to rely on this to keep the distance to an advantageous level.

I don't see why lower overhead would prevent this from working?

If io_uring async copying is advantageous even when the consumer never
needs to wait, then it seems like parallelizing copying to/from the
kernel buffer cache will always be advantageous to do at some level.

It's not universally advantageous, unfortunately - there's a nontrivial
increase in latency (and also some CPU) due to it. Which matters mostly when
having a shallow look-ahead depth (like at the start of a stream), because
then the latency impact will directly influence query performance.

Setup:

CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, 500)) AS pattern;

workload:

SET io_combine_limit = 1;

SET effective_io_concurrency=1;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;

(and then repeated for eic 2,4,8,16)

eic time plain ms time w/ forced async
1 2.331 5.366
2 2.164 3.210
4 2.151 2.677
8 2.155 2.749
16 2.151 2.742
32 2.141 2.732
64 2.161 2.739
128 2.153 2.652

Note that forced async never quite catches up.

If I instead make the pattern 50k blocks long:

eic time plain ms time w/ forced async
1 210.678 454.132
2 209.210 281.452
4 208.775 198.496
8 208.755 198.131
16 209.477 195.799
32 203.497 183.297
64 203.002 173.799
128 202.885 166.548

The case where it is not (as you've stated before) is when the
consumer doesn't need the extra blocks, so it is just wasted time
spent acquiring them.

That's one reason, but as shown above, it's also that the increase in latency
can hurt, particularly in the first few blocks (where we are ramping up the
distance) and when effective_io_concurrency is too low to allow for a deep
enough read-ahead to allow to hide the latency increase.

So, it feels odd to try and find a heuristic that allows the readahead
distance to increase even when the consumer is not having to wait.

Do you still feel like that with the added context from the above?

I'm not saying we should do this for this release, but I'm just wondering if
in the medium term, we should try to find a better way to identify the
situation where async processing is not beneficial because the blocks won't
be needed.

I think we certainly can do better than today with some help, e.g. from the
planner, to identify cases where we should be more careful about reading ahead
too far, e.g. due to being on the inner side of an nestloop antijoin.

So 0005 expands the io_uring heuristic to also trigger based on the sizes
of IOs - but that's decidedly not perfect, we e.g. have some experiments
showing it regressing some parallel bitmap heap scan cases. It may be
better to somehow tweak the logic to only trigger for worker.

As is this has another issue, which is that it prevents IO combining in
situations where it shouldn't, because right now using the distance to
control both. See 0008 for an attempt at splitting those concerns.

Yea, I think running ahead far enough to get bigger IOs needs to
happen and can't be based on the consumer having to wait.

What do you think about the updated patch to achieve that that I posted?

Greetings,

Andres Freund

#11

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Andres Freund (#10)

Re: AIO / read stream heuristics adjustments for index prefetching

On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:

On some level, relying on worker mode overhead feels fragile. If
worker overhead decreases—say, by moving to IO worker threads—we won't
be able to rely on this to keep the distance to an advantageous level.

I don't see why lower overhead would prevent this from working?

needed_wait has to be true to increase the readahead distance and for
io_uring, when data was in the kernel buffer cache, needed_wait is
false, meaning the distance doesn't increase. Worker mode didn't have
this problem because of overhead. So needed_wait is true for workers.
But, now that we will have combine_distance, I guess we don't need to
rely on workers having overhead. So we are saying that
readahead_distance is completely irrelevant for copying from the
kernel buffer cache and only combine_distance matters for that now,
right?

Yea, I think running ahead far enough to get bigger IOs needs to
happen and can't be based on the consumer having to wait.

What do you think about the updated patch to achieve that that I posted?

Will post separately.

- Melanie

#12

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#11)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-04-02 17:13:34 -0400, Melanie Plageman wrote:

On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:

On some level, relying on worker mode overhead feels fragile. If
worker overhead decreases—say, by moving to IO worker threads—we won't
be able to rely on this to keep the distance to an advantageous level.

I don't see why lower overhead would prevent this from working?

needed_wait has to be true to increase the readahead distance and for
io_uring, when data was in the kernel buffer cache, needed_wait is
false, meaning the distance doesn't increase. Worker mode didn't have
this problem because of overhead. So needed_wait is true for workers.
But, now that we will have combine_distance, I guess we don't need to
rely on workers having overhead.

I think we still do, but that it will continue to work, even if the overhead
is much smaller than today. The workers will complete the IOs only after the
memory copy is finished (duh). Therefore, if the distance is too small to
allow workers to complete the copy, the distance will be increased, due to
needed_wait.

So we are saying that readahead_distance is completely irrelevant for
copying from the kernel buffer cache and only combine_distance matters for
that now, right?

I don't think so! The combine_distance thing is crucial to allow for IO
combining, and, indirectly, for triggering the size based "async" heuristics
with io_uring. Once the io_uring async heuristic is triggered, the needed_wait
mechanism works to further increase the distance.

That does mean that for random BLCKSZ sized IOs that are in the page cache the
async mechanism won't typically be triggered - but from what I can tell that's
ok, because lots of 8kB IOs is also where the dispatch overhead to the kernel
threads doing the copying is the highest.

Greetings,

Andres Freund

#13

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Andres Freund (#10)

Re: AIO / read stream heuristics adjustments for index prefetching

On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:

What do you think about the updated patch to achieve that that I posted?

here is some review on 0005 and 0006 earlier posted
concrete things:
-------
- I’d reorder stream→distance == 0 in read_stream_look_ahead() (and
issue), I found myself asking why it wasn’t first

- I agree with bilal that
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)
return false;
should be in the other commit because it isn’t required with the
current code because of how distance is set and is more confusing.
wasn’t it an assert before?

- perhaps the new heuristic for allowing further look ahead if we are
building an IO and it isn’t big enough yet should be in its own commit

- why have read_stream_pause() save combine_distance if it isn’t going
to zero it out?
- I think the comments you added to the read stream struct for
combine_distance and readahead_distance should indicate units (i.e.
blocks)

- why not remove the combine_distance requirement from the fast path
entry criteria (could save resume_combine_distance in the fast path
and restore it after a miss)

- can we move the fast path to where we found out we got a hit?

- should_issue_now has a lot of overlap with should_read_ahead which
makes it confusing to figure out how they are different, but I think
that is related to readahead_distance being in units of blocks and not
IOs (which I talk about later)

- There's also some assorted typos and comment awkwardness that I
assume you will clean up in the polish step
(e.g. "if we have an the process" -> "if we are in the process",
"oveflow" -> "overflow", The NB comment still says stream->distance ==
0 but should say stream->readahead_distance == 0)

some more ambiguous stuff:
-------

In my experiments the batching was primarily useful to allow to reduce the
command submission & interrupt overhead when interacting with storage, i.e.
when actually doing IO, not just copying from the page cache.

It still seems to me that batching would help guarantee parallel
memcpys for workers when data is in the kernel buffer cache because if
the IO is quick, the same worker may pick up the next IO.
---
You mentioned that we don't want to read too far ahead (including for
a single combined IO) in part because:

The resowner and private refcount mechanisms take more CPU cycles if you have
more buffers pinned

But I don't see how either distance is responding to this or
self-calibrating with this in mind
---

I liked the idea of being more aggressive to do IO combining. What is
the reason for gradually increasing combine_distance, is it to not do
unnecessary IOs at the start?

Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
hurt with index scan prefetching, because of query plans where we only consume
part of the index scan (e.g. a nested loop antijoin).

You said we have to ramp up combine_distance because in index
prefetching when we don’t need all the blocks, it can hurt. But if we
ramp it up unconditionally (assuming we did IO), then I don’t see how
this would solve the problem as it will ramp up regardless after just
a few IOs anyway.
---

I've been experimenting with going the other way, by having
read_stream_should_look_ahead() do a check like

So, as you say, with, for example, a large io_combine_limit and small
effective_io_concurrency, this would result in much lower IO
concurrency than we want. But, I think this speaks to the central
tension I see with the new combine_distance and readahead_distance: it
seems like readahead_distance should now be in units of IO and
combine_distance in units of blocks. If that were the case, this
heuristic wouldn't have a downside.

Obviously ramping readahead_distance up and down when it is in units
of IO becomes much more fraught though.

But our criteria for wanting to make bigger IOs is different than our
criteria for wanting more IOs in flight.
---
I think it is weird to have combine_distance only be relevant when
readahead_distance is low. You said:

We could, but I don't think there would be a benefit in doing so. In my mind,
what combine_distance is used for, is to control how far to look ahead to
allow for IO combining when the readahead_distance is too low to allow for IO
combining. But if pinned_buffers > 0, we already have another IO in flight,
so the combine_distance mechanism doesn't have to kick in.

But it seems like for a completely uncached workload bigger IOs is
still beneficial. We may want to avoid reading ahead to get bigger IOs
when there is a chance we only need a few blocks. And, we want to
parallelize copies across workers, so we want a combine_distance that
is small enough that, relative to effective_io_concurrency, we can
split IOs across workers some. But, otherwise, we want maximally sized
IOs. In the medium term, perhaps we can try to use executor hints to
handle the former. And, assuming you think readahead_distance in IOs
is a non-starter, maybe we could use a heuristic that tries to balance
IOs by dividing readahead_distance by combine_distance when
combine_distance is sufficiently high or something like that.

- Melanie

#14

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Andres Freund (#8)

Re: AIO / read stream heuristics adjustments for index prefetching

On Thu, Apr 2, 2026 at 9:33 AM Andres Freund <andres@anarazel.de> wrote:

+                /*
+                 * XXX: Should we actually reduce this at any time other than
+                 * a reset? For now we have to, as this is also a condition
+                 * for re-enabling fast_path.
+                 */
+                if (stream->combine_distance > 1)
+                    stream->combine_distance--;

I don't think we need to reduce this other than reset.

Hm. I go back and forth on that one :)

Separate from the fast-path enablement, we also probably want to
decrease combine distance when we decrease readahead_distance because
there is a point where we still want to parallelize the IOs even when
the distance is lower and to do that, we need to make smaller IOs. I'm
not sure where this point is, but I wonder if a few 256kB IOs is
faster than 1 1MB IO (could test that with fio actually). I imagine
that there is some size where that is true because of peculiarities in
how drives (and cloud storage) issue/break up IOs after they are a
certain size, etc.

- Melanie

#15

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#13)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-04-03 11:46:59 -0400, Melanie Plageman wrote:

On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:

What do you think about the updated patch to achieve that that I posted?

here is some review on 0005 and 0006 earlier posted
concrete things:
-------
- I’d reorder stream→distance == 0 in read_stream_look_ahead() (and
issue), I found myself asking why it wasn’t first

- I agree with bilal that
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)
return false;
should be in the other commit because it isn’t required with the
current code because of how distance is set and is more confusing.
wasn’t it an assert before?

Will do.

- perhaps the new heuristic for allowing further look ahead if we are
building an IO and it isn’t big enough yet should be in its own commit

Isn't that the whole point of the commit? What else is there?

- why have read_stream_pause() save combine_distance if it isn’t going
to zero it out?

It should zero it out.

- I think the comments you added to the read stream struct for
combine_distance and readahead_distance should indicate units (i.e.
blocks)

Will do.

- why not remove the combine_distance requirement from the fast path
entry criteria (could save resume_combine_distance in the fast path
and restore it after a miss)

Because entering the fast path prevents IO combining from happening. So it's
absolutely crucial that we do *not* enter it while we would combine.

- can we move the fast path to where we found out we got a hit?

Yea, I think we should. Seems like a separate project though.

- should_issue_now has a lot of overlap with should_read_ahead which
makes it confusing to figure out how they are different

Yea. I'm not sure how to do it better though.

but I think that is related to readahead_distance being in units of blocks
and not IOs (which I talk about later)

I agree that we should change that, but it seems like a separate project to
me.

- There's also some assorted typos and comment awkwardness that I
assume you will clean up in the polish step
(e.g. "if we have an the process" -> "if we are in the process",
"oveflow" -> "overflow", The NB comment still says stream->distance ==
0 but should say stream->readahead_distance == 0)

Yea.

some more ambiguous stuff:
-------

In my experiments the batching was primarily useful to allow to reduce the
command submission & interrupt overhead when interacting with storage, i.e.
when actually doing IO, not just copying from the page cache.

It still seems to me that batching would help guarantee parallel
memcpys for workers when data is in the kernel buffer cache because if
the IO is quick, the same worker may pick up the next IO.

When you say workers, do you mean for io_method=worker? Or the internal kernel
threads of io_uring for async IOs?

---
You mentioned that we don't want to read too far ahead (including for
a single combined IO) in part because:

The resowner and private refcount mechanisms take more CPU cycles if you have
more buffers pinned

But I don't see how either distance is responding to this or
self-calibrating with this in mind

Using the minimal required distance to avoid needing to wait for IO completion
is responding to that, no? Without these patches we read ahead as far as
possible, even if all the data is in the page cache, which makes this issue
way worse (without these patches it's a major source of regressions in the
index prefetching patch).

---

I liked the idea of being more aggressive to do IO combining. What is
the reason for gradually increasing combine_distance, is it to not do
unnecessary IOs at the start?

Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
hurt with index scan prefetching, because of query plans where we only consume
part of the index scan (e.g. a nested loop antijoin).

You said we have to ramp up combine_distance because in index
prefetching when we don’t need all the blocks, it can hurt. But if we
ramp it up unconditionally (assuming we did IO), then I don’t see how
this would solve the problem as it will ramp up regardless after just
a few IOs anyway.

It limits the percentage of "wasted IO". So it does not entirely solve it - I
think that's impossible - but reduces it enough to be ok.

If the consumer stops consuming after e.g. 1 IOs, an immediate rampup would
lead to 15 IOs being wasted (assuming io_combine_limit 16). So only ~6% of the
read in blocks are used.

If instead the consumer stops only after already having ramped up to the max
combine distance, it'd have to have consumed a lot more buffers, as the
distance is increased only after consuming a buffer that required IO.

Think it's something like:

io of size 1, block 0
consume buffer #1 -> combine_distance 2
io of size 2, block 1-2
consume buffer #2 -> combine_distance 4
consume buffer #3
io of size 4, block 3-6
consume buffer #4 -> combine_distance 8
consume buffer #4-#7
io of size 8, block 7-14
consume buffer #8 -> combine_distance 16
consume buffer #9-#15
io of size 16, block 15-30
consume buffer #16 -> combine distance can't be increased further
<reset>

Consumer consumed 16 buffers, we did IO for 31 blocks, wasted 15. So we used
~51% of the read in blocks are used.

While 51% is not perfect, it still a considerable improvement in worst-case
wasteage. It's also less of a problem because more CPU was spent in the
consumer, given that 16 buffers had to be processed to get to this point.

---

I've been experimenting with going the other way, by having
read_stream_should_look_ahead() do a check like

/*
* If we already have IO in flight, but are close enough to to the
* distance limit that we would not start a fully sized IO, don't even
* start a pending read until later.
*
* This avoids calling the call thing the next block callback in cases we
* would not start the pending read anyway. For some users the work to
* determine the next block is non-trivial, so we don't want to do so
* earlier than necessary.
*
* A secondary benefit of this is that some callers use parallel workers
* with each their own read stream to process a global list of blocks, and
* only calling the next block callback when ready to actually issue IO
* makes it more likely for one backend to get consecutive blocks.
*/
if (stream->pinned_buffers > 0 &&
stream->pending_read_nblocks == 0 &&
stream->pinned_buffers + stream->combine_distance >=
stream->readahead_distance)
return false;

So, as you say, with, for example, a large io_combine_limit and small
effective_io_concurrency, this would result in much lower IO
concurrency than we want. But, I think this speaks to the central
tension I see with the new combine_distance and readahead_distance: it
seems like readahead_distance should now be in units of IO and
combine_distance in units of blocks. If that were the case, this
heuristic wouldn't have a downside.

Obviously ramping readahead_distance up and down when it is in units
of IO becomes much more fraught though.

I think we might want to move to that world, but I think that'll require a lot
more work to validate. In cases where you have a mix of hits and misses, with
the misses requiring actual IO, we'd end up with a lot less readahead than
today.

---
I think it is weird to have combine_distance only be relevant when
readahead_distance is low. You said:

We could, but I don't think there would be a benefit in doing so. In my mind,
what combine_distance is used for, is to control how far to look ahead to
allow for IO combining when the readahead_distance is too low to allow for IO
combining. But if pinned_buffers > 0, we already have another IO in flight,
so the combine_distance mechanism doesn't have to kick in.

But it seems like for a completely uncached workload bigger IOs is
still beneficial.

Massively so - the storage layer getting too small IOs really hurts.

But, as the code stands, we *do* end up with large IOs in that case, because
we will not issue the IO until it is "complete". If we need to do actual IO,
the readahead_distance will be larger, and allow multiple full sized IOs to be
issued.

/*
* We don't start the pending read just because we've hit the distance limit,
* preferring to give it another chance to grow to full io_combine_limit size
* once more buffers have been consumed. But this is not desirable in all
* situations - see below.
*/
static inline bool
read_stream_should_issue_now(ReadStream *stream)
{
...
/*
* If we've already reached io_combine_limit, there's no chance of growing
* the read further.
*/
if (pending_read_nblocks >= stream->io_combine_limit)
return true;

/* same if capped not by io_combine_limit but combine_distance */
if (stream->combine_distance > 0 &&
pending_read_nblocks >= stream->combine_distance)
return true;

/*
* If we currently have no reads in flight or prepared, issue the IO once
* we are not looking ahead further. This ensures there's always at least
* one IO prepared.
*/
if (stream->pinned_buffers == 0 &&
!read_stream_should_look_ahead(stream))
return true;

return false;

So, unless we are not reading ahead, we are not issuing IOs until they are
fully sized (or can't be combined, obviously).

Am I misunderstanding what you're talking about?

Greetings,

Andres Freund

#16

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#14)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-04-03 12:45:50 -0400, Melanie Plageman wrote:

On Thu, Apr 2, 2026 at 9:33 AM Andres Freund <andres@anarazel.de> wrote:
+                /*
+                 * XXX: Should we actually reduce this at any time other than
+                 * a reset? For now we have to, as this is also a condition
+                 * for re-enabling fast_path.
+                 */
+                if (stream->combine_distance > 1)
+                    stream->combine_distance--;
I don't think we need to reduce this other than reset.
Hm. I go back and forth on that one :)
Separate from the fast-path enablement, we also probably want to
decrease combine distance when we decrease readahead_distance because
there is a point where we still want to parallelize the IOs even when
the distance is lower and to do that, we need to make smaller IOs.

I'm not sure that's something we really need to worry about at this point. If
readahead_distance is so small that it does not allow enough IO concurrency,
we will have to wait for IO completion, which in turn will lead to the
readahead distance being increased again.

I can see some corner cases where this would not suffice, e.g. if you have a
rather low pin limit, but I doubt those are relevant in practice?

I'm not sure where this point is, but I wonder if a few 256kB IOs is faster
than 1 1MB IO (could test that with fio actually).

Yes, that point definitely exists. But I think the mechanism for that is to
configure io_combine_limit at or below the threshold at which even bigger IOs
hurt.

I imagine that there is some size where that is true because of
peculiarities in how drives (and cloud storage) issue/break up IOs after
they are a certain size, etc.

It's even true for synchronous copies from the kernel page cache, due to some
hardware issue I have yet to fully understand. On both Intel and AMD CPUs,
unless SMAP is disabled, larger copies from kernel to userspace start to to be
substantially slower, somewhere around 1-4MBs per IO.

Greetings,

Andres Freund

#17

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Andres Freund (#15)

Re: AIO / read stream heuristics adjustments for index prefetching

On Fri, Apr 3, 2026 at 1:30 PM Andres Freund <andres@anarazel.de> wrote:

- why not remove the combine_distance requirement from the fast path
entry criteria (could save resume_combine_distance in the fast path
and restore it after a miss)

Because entering the fast path prevents IO combining from happening. So it's
absolutely crucial that we do *not* enter it while we would combine.

But if it is a buffer hit, obviously we can't do IO combining anyway,
or am I misunderstanding the fast path's common case?

In my experiments the batching was primarily useful to allow to reduce the
command submission & interrupt overhead when interacting with storage, i.e.
when actually doing IO, not just copying from the page cache.

It still seems to me that batching would help guarantee parallel
memcpys for workers when data is in the kernel buffer cache because if
the IO is quick, the same worker may pick up the next IO.

When you say workers, do you mean for io_method=worker? Or the internal kernel
threads of io_uring for async IOs?

I'm talking about io_method worker.

You mentioned that we don't want to read too far ahead (including for
a single combined IO) in part because:

The resowner and private refcount mechanisms take more CPU cycles if you have
more buffers pinned

But I don't see how either distance is responding to this or
self-calibrating with this in mind

Using the minimal required distance to avoid needing to wait for IO completion
is responding to that, no? Without these patches we read ahead as far as
possible, even if all the data is in the page cache, which makes this issue
way worse (without these patches it's a major source of regressions in the
index prefetching patch).

But we aren't using the minimal distance to avoid needing to wait for
IO completion. We are also using a higher distance to try and get IO
combining and toallow for async copying into the kernel buffer cache,
etc, etc. There's a lot of different considerations; it isn't just two
opposing forces. And, I'd imagine that the relationship between the
number of buffers pinned and CPU cycles required for resowner/refcount
isn't perfectly linear.

I think it is weird to have combine_distance only be relevant when
readahead_distance is low. You said:

We could, but I don't think there would be a benefit in doing so. In my mind,
what combine_distance is used for, is to control how far to look ahead to
allow for IO combining when the readahead_distance is too low to allow for IO
combining. But if pinned_buffers > 0, we already have another IO in flight,
so the combine_distance mechanism doesn't have to kick in.

But it seems like for a completely uncached workload bigger IOs is
still beneficial.

Massively so - the storage layer getting too small IOs really hurts.

But, as the code stands, we *do* end up with large IOs in that case, because
we will not issue the IO until it is "complete". If we need to do actual IO,
the readahead_distance will be larger, and allow multiple full sized IOs to be
issued.

/*
* We don't start the pending read just because we've hit the distance limit,
* preferring to give it another chance to grow to full io_combine_limit size
* once more buffers have been consumed. But this is not desirable in all
* situations - see below.
*/
static inline bool
read_stream_should_issue_now(ReadStream *stream)
{
...
/*
* If we've already reached io_combine_limit, there's no chance of growing
* the read further.
*/
if (pending_read_nblocks >= stream->io_combine_limit)
return true;

/* same if capped not by io_combine_limit but combine_distance */
if (stream->combine_distance > 0 &&
pending_read_nblocks >= stream->combine_distance)
return true;

/*
* If we currently have no reads in flight or prepared, issue the IO once
* we are not looking ahead further. This ensures there's always at least
* one IO prepared.
*/
if (stream->pinned_buffers == 0 &&
!read_stream_should_look_ahead(stream))
return true;

return false;

So, unless we are not reading ahead, we are not issuing IOs until they are
fully sized (or can't be combined, obviously).

Am I misunderstanding what you're talking about?

I'm not saying that we don't do IO combining at high distances, I'm
more saying that it is confusing that combine_distance controls how
far we look ahead when readahead_distance is low but when
readahead_distance is high, it controls when we issue the IO and not
how far we look ahead. I don't think we should change course now, but
I wanted to call out that this felt a little uncomfortable to me.

- Melanie

#18

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#17)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

On 2026-04-03 15:04:51 -0400, Melanie Plageman wrote:

On Fri, Apr 3, 2026 at 1:30 PM Andres Freund <andres@anarazel.de> wrote:

- why not remove the combine_distance requirement from the fast path
entry criteria (could save resume_combine_distance in the fast path
and restore it after a miss)

Because entering the fast path prevents IO combining from happening. So it's
absolutely crucial that we do *not* enter it while we would combine.

But if it is a buffer hit, obviously we can't do IO combining anyway,
or am I misunderstanding the fast path's common case?

It's true that we can't do combining in the fast path, but the problem is that
with eic=0/1 (or a recent history that leaves us with low distances or a low
pin limit), we will not start the next IO until there are no more buffers
pinned.

Imagine that we started one 16 block IO and have a readahead_distance of
1. After consuming 15 buffers, we will have one more buffer pinned, but
read_stream_look_ahead() will not yet start another IO, due to the
readahead_distance condition (or max_pinned_buffers or ...). Without the
stream->combine_distance == 1 check, the subsequent check for
read_stream_next_buffer() would consider this a valid case for entering
fast-path.

You mentioned that we don't want to read too far ahead (including for
a single combined IO) in part because:

The resowner and private refcount mechanisms take more CPU cycles if you have
more buffers pinned

But I don't see how either distance is responding to this or
self-calibrating with this in mind

Using the minimal required distance to avoid needing to wait for IO completion
is responding to that, no? Without these patches we read ahead as far as
possible, even if all the data is in the page cache, which makes this issue
way worse (without these patches it's a major source of regressions in the
index prefetching patch).

But we aren't using the minimal distance to avoid needing to wait for
IO completion. We are also using a higher distance to try and get IO
combining and toallow for async copying into the kernel buffer cache,
etc, etc.

My testing suggests that doing IO combining for a reasonble io_combine_limit
is pretty much always a win in a steady-state stream (i.e. not a short one
that's not fully consumed), the gain from avoiding the larger amounts of
syscalls sufficiently large.

One we start doing async copying from the kernel page cache, we will have to
wait for the completion of that async work, which will lead to
readahead_distance being increased if necessary.

There's a lot of different considerations; it isn't just two opposing
forces.

It's not, but I think always performing io_combine_limit sized IOs after a
ramp-up and increasing the distance based on needing to wait is a pretty
decent heuristic.

For best results it does require pgaio_uring_should_use_async() to trigger, as
otherwise we do not get get the parallelized memory copy. Which means it may
never trigger if we don't occasionally reach the size based condition. Luckily
it does not seem like using async is beneficial for small IOs.

And, I'd imagine that the relationship between the
number of buffers pinned and CPU cycles required for resowner/refcount
isn't perfectly linear.

It's definitely not.

I'm not saying that we don't do IO combining at high distances, I'm
more saying that it is confusing that combine_distance controls how
far we look ahead when readahead_distance is low but when
readahead_distance is high, it controls when we issue the IO and not
how far we look ahead. I don't think we should change course now, but
I wanted to call out that this felt a little uncomfortable to me.

I'm not sure I see an alternative. I tried to at least improve the comments
around this.

Attached are a revised set of commits. The largest changes are:

- Reordered the series to put
"read_stream: Only increase read-ahead distance when waiting for IO"
after
"stream: Split decision about look ahead for AIO and combining"

Previously I thought it'd be too awkward from a comment perspective, but
there's only one comment where it is a bit odd.

Think it's much clearer this way.

- Largely rewrote "Hacky implementation of making read_stream_reset()/end()
not wait for IO". Looks a lot saner now.

Think this needs a few more tests, in particular for the read stream and
foreign_io paths. Will do that in the next version.

- Tried to address most of Bilal's and Melanie's feedback

- Removed some redundant checks from read_stream_should_issue_now()

- Lots of comment polishing, including revising the top-level read_stream.c
comment

Greetings,

Andres Freund

#19

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Melanie Plageman (#13)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

Forgot to send this one earlier:

On 2026-04-03 11:46:59 -0400, Melanie Plageman wrote:

On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:

What do you think about the updated patch to achieve that that I posted?

here is some review on 0005 and 0006 earlier posted
concrete things:
-------
- I’d reorder stream→distance == 0 in read_stream_look_ahead() (and
issue), I found myself asking why it wasn’t first

It'd not be correct in should_issue_now(). We can't move it before

/* there is no pending IO that could be issued */
if (pending_read_nblocks == 0)
return false;

because then we'd trigger a call to read_stream_start_pending_read(), without
there being a pending read.

We can't move it before

/* never start more IOs than our cap */
if (stream->ios_in_progress >= stream->max_ios)
return false;

because that could lead us to exceeding max_ios.

Greetings,

Andres Freund

#20

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Andres Freund (#18)

Re: AIO / read stream heuristics adjustments for index prefetching

Hi,

There are a bunch of heuristics mentioned in the following proposed commit:

On 2026-04-03 16:36:03 -0400, Andres Freund wrote:

Subject: [PATCH v5 1/5] aio: io_uring: Trigger async processing for large IOs

io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
once the IO depth is a bit larger. That heuristic is important when doing
buffered IO from the kernel page cache, to allow parallelizing of the memory
copy, as otherwise io_method=io_uring would be a lot slower than
io_method=worker in that case.

An upcoming commit will make read_stream.c only increase the read-ahead
distance if we needed to wait for IO to complete. If to-be-read data is in the
kernel page cache, io_uring will synchronously execute IO, unless the IO is
flagged as async. Therefore the aforementioned change in read_stream.c
heuristic would lead to a substantial performance regression with io_uring
when data is in the page cache, as we would never reach a deep enough queue to
actually trigger the existing heuristic.

Parallelizing the copy from the page cache is mainly important when doing a
lot of IO, which commonly is only possible when doing largely sequential IO.

The reason we don't just mark all io_uring IOs as asynchronous is that the
dispatch to a kernel thread has overhead. This overhead is mostly noticeable
with small random IOs with a low queue depth, as in that case the gain from
parallelizing the memory copy is small and the latency cost high.

The facts from the two prior paragraphs show a way out: Use the size of the IO
in addition to the depth of the queue to trigger asynchronous processing.

One might think that just using the IO size might be enough, but
experimentation has shown that not to be the case - with deep look-ahead
distances being able to parallelize the memory copy is important even with
smaller IOs.

+/*
+ * io_uring executes IO in process context if possible. That's generally good,
+ * as it reduces context switching. When performing a lot of buffered IO that
+ * means that copying between page cache and userspace memory happens in the
+ * foreground, as it can't be offloaded to DMA hardware as is possible when
+ * using direct IO. When executing a lot of buffered IO this causes io_uring
+ * to be slower than worker mode, as worker mode parallelizes the
+ * copying. io_uring can be told to offload work to worker threads instead.
+ *
+ * If the IOs are small, we only benefit from forcing things into the
+ * background if there is a lot of IO, as otherwise the overhead from context
+ * switching is higher than the gain.
+ *
+ * If IOs are large, there is benefit from asynchronous processing at lower
+ * queue depths, as IO latency is less of a crucial factor and parallelizing
+ * memory copies is more important.  In addition, it is important to trigger
+ * asynchronous processing even at low queue depth, as with foreground
+ * processing we might never actually reach deep enough IO depths to trigger
+ * asynchronous processing, which in turn would deprive readahead control
+ * logic of information about whether a deeper look-ahead distance would be
+ * advantageous.
+ *
+ * We have done some basic benchmarking to validate the thresholds used, but
+ * it's quite plausible that there are better values.

Thought it'd be useful to actually have an email to point to in the above
comment, with details about what benchmark I ran.

Previously I'd just manually run fio with different options, I made it a bit
more systematic with the attached (only halfway hand written) script.

I attached two different results, once when allowing access to multiple cores,
and once with a single core (simulating a very busy machine).

(nblocks is in multiples of 8KB)

Multi-core:

nblocks iod async bw_gib_s lat_usec
1 1 0 4.2075 1.5802
1 1 1 1.0462 6.9652
1 2 0 4.1362 3.4533
1 2 1 1.9284 7.6040
1 4 0 4.0030 7.3720
1 4 1 4.2713 6.9086
1 8 0 4.1653 14.4072
1 8 1 4.3301 13.8365
1 16 0 4.1829 28.9216
1 16 1 4.3006 28.1261
1 32 0 4.0735 59.6232
1 32 1 4.3248 56.1614

I.e at nblocks=1, there's pretty much no gain from async, and the latency
increases markedly at the low end and just about catches up at the high end.

Around an iodepth 4 the loss from async nonexistant or minimal.

2 1 0 5.7289 2.4261
2 1 1 1.8708 7.7466
2 2 0 5.7964 5.0144
2 2 1 3.3749 8.7417
2 4 0 5.8434 10.2023
2 4 1 7.9783 7.3977
2 8 0 5.8166 20.7226
2 8 1 8.2545 14.5431
2 16 0 5.8215 41.6613
2 16 1 8.2354 29.3879
2 32 0 5.6530 86.0286
2 32 1 8.3218 58.3826

With nblocks=2, there start to be gains at higher IO depths, but they're still
somewhat limited. Latency already starts to be better at iodepth 4.

4 1 0 7.4131 3.8807
4 1 1 3.2133 9.1827
4 2 0 7.3150 8.0854
4 2 1 5.4983 10.8039
4 4 0 7.2784 16.5097
4 4 1 11.2717 10.5699
4 8 0 7.2873 33.2331
4 8 1 16.6299 14.4164
4 16 0 7.1606 67.8777
4 16 1 16.9794 28.4981
4 32 0 6.2954 154.6834
4 32 1 16.3686 59.3610

With nblocks=4, async shows much more substantial gains. Latency of async at
the high end is also much improved.

8 1 0 8.0403 7.3503
8 1 1 4.6038 12.7202
8 2 0 8.0052 14.9161
8 2 1 8.5176 13.9987
8 4 0 8.1519 29.6698
8 4 1 14.8211 16.1640
8 8 0 7.8525 61.8612
8 8 1 27.5860 17.4434
8 16 0 6.8887 141.3268
8 16 1 34.1307 28.3463
8 32 0 6.9031 282.2350
8 32 1 38.2430 50.7700

With nblocks=8, async is faster already at iodepth 2.

64 1 0 9.1983 52.6768
64 1 1 8.1505 59.5486

128 1 0 7.5442 128.8704
128 1 1 7.3481 132.2355

Somewhere nblocks=64 and 128, we reach the point where there's basically no
loss at iodepth 1.

This seems to validate setting IOSQE_ASYNC around a block size of >= 4 and a
queue depth of > 4. I guess it could make sense to reduce it from > 4 to >= 4
based on these numbers, but I don't think it matters terribly.

Obviously with just one core there will only ever be a loss from doing an
asynchronous / concurrent copy from the page cache. But it's interesting to
see where the overhead of async starts to be less of a factor.

At iodepth 1 (worse case, a context switch for every IO)

nblocks iod async bw_gib_s lat_usec
1 1 0 4.2324 1.5692
1 1 1 1.7883 3.9574
2.36x bw regression

2 1 0 5.7914 2.4004
2 1 1 2.9585 4.8417
1.96x bw regression

4 1 0 7.3171 3.9242
4 1 1 4.2450 6.8171
1.7x bw regression

8 1 0 8.1162 7.2674
8 1 1 5.7536 10.2948
1.4x bw regression

16 1 0 8.8559 13.5212
16 1 1 7.1163 16.8277
1.6x bw regression

But the IO depth would not stay at 1 in the case of postgres with the proposed
changes, it'd ramp up due to needing to wait for the kernel to complete those
IOs asynchronously.

Therefore comparing that to a deeper IO depth.

nblocks iod async bw_gib_s lat_usec
1 16 0 4.1094 29.4339
1 16 1 3.3922 35.7044
1.21x bw regression

2 16 0 5.8381 41.5402
2 16 1 4.8104 50.4571
1.21x bw regression

4 16 0 7.1204 68.2612
4 16 1 5.6479 86.0973
1.26x bw regression

8 16 0 7.0780 137.5520
8 16 1 6.1687 157.8805
1.14x bw regression

16 16 0 7.4523 261.4281
16 16 1 6.7192 290.0837
1.10x bw regression

This assumes a very extreme scenario (no cycles whatsoever available for
parallelism), so I'm just looking for the worst case regression here.

I don't think there's very clear indicators for what cutoffs to use in the
onecpu data. Clearly we shouldn't go for async for single block IOs, but we
aren't. With the default io_combine_limit=16 effective_io_concurrency=16,
we'd end up with 1.10x regression in the extreme case of only having a single
core available (but that one fully!) and doing nothing other than IO.

Seems ok to me.

I ran it on three other machines (newer workstation, laptop, old laptop) as
well, with similarly shaped results (although considerably higher & lower
throughputs across the board, depending on the machine).

Zen 4 Laptop:
nblocks iod async bw_gib_s lat_usec
1 1 0 6.0989 1.1408
1 1 1 1.4477 5.1246
1 2 0 6.9600 2.0827
1 2 1 2.8750 5.1711
1 4 0 7.0283 4.2307
1 4 1 8.9174 3.3169

Suprisingly bigger difference between sync/async at iod=1, but it's again
similar around iod=4 blocks.

4 1 0 14.5638 1.9616
4 1 1 5.1245 5.8016
4 2 0 14.8867 3.9607
4 2 1 12.1841 4.8662
4 4 0 14.8678 8.0764
4 4 1 21.5077 5.5417

Similar.

16 1 0 21.0754 5.5891
16 1 1 12.6180 9.4753
16 2 0 20.2770 11.8353
16 2 1 24.3277 9.8172

At nblocks=16, iod=2 starts already starts to be faster.

Greetings,

Andres Freund