Don't synchronously wait for already-in-progress IO in read stream

Started by Andres Freund9 months ago33 messageshackers

andres@anarazel.de

9 months ago

Hi,

In the index prefetching thread we discovered that read stream performance
suffers rather substantially when a read stream is reading blocks multiple
times within the readahead distance.

The problem leading to that is that we are currently synchronously waiting for
IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
IO. If a block is read twice, that means we won't actually have enough IOs in
flight to have good performance. What's worse, currently the wait is not
attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
IO).

This does not commonly occur with in-tree users of read streams, as users like
seqscans, bitmap heap scans, vacuum, ... will never try to read the same block
twice. However with index prefetching that is a more common case.

It is possible to encounter a version of this in 18/master: If multiple scans
for the same table are in progress, they can end up waiting synchronously for
each other. However it's a much less severe issue, as the scan that is
"further ahead" will not be blocked.

To fix it, the attached patch has AsyncReadBuffers() check if the "target"
buffer already has IO in progress. If so, it assing the buffer's IO wait
reference to the ReadBuffersOperation. That allows WaitReadBuffers() to wait
for the IO. To make that work correctly, the buffer stats etc have to be
updated in that case in WaitReadBuffers().

I did not feel like I was sufficiently confident in making this work without
tests. However, it's not exactly trivial to test some versions of this, due to
the way multiple processes need to be coordinated. It took way way longer to
write tests than to make the code actually work :/.

The attached tests add a new read_stream_for_blocks() function to test_aio. I
found it also rather useful to reproduce the performance issue without the
index prefetching patch applied. To test the cross process case the injection
point infrastructure in test_aio had to be extended a bit.

Attached are three patches:

0001: Introduces a TestAio package and splits out some existing tests out of
001_aio.pl

0002: Add read_stream test infrastructure & tests

Note that the tests don't test that we don't unnecessarily wait, as
described above, just that IO works correctly.

0003: Improve performance of read stream when re-encountering blocks

To reproduce the issue, the read_stream_for_blocks() function added to
test_aio can be used, in combination with debug_io_direct=data (it's also
possible without DIO, it'd just be more work).

prep:
CREATE EXTENSION test_aio;
CREATE TABLE large AS SELECT i, repeat(random()::text, 5) FROM generate_series(1, 1000000) g(i);

test:
SELECT pg_buffercache_evict_relation('large');
EXPLAIN ANALYZE SELECT * FROM read_stream_for_blocks('large', ARRAY(SELECT i + generate_series(0, 3) FROM generate_series(1, 10000) g(i)));

Without 0003 applied:
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks (cost=975.00..985.00 rows=1000 width=12) (actual time=673.647..675.254 rows=40000.00 loops=1) │
│ Buffers: shared hit=29997 read=10003 │
│ I/O Timings: shared read=16.116 │
│ InitPlan 1 │
│ -> Result (cost=0.00..975.00 rows=40000 width=4) (actual time=0.556..7.575 rows=40000.00 loops=1) │
│ -> ProjectSet (cost=0.00..375.00 rows=40000 width=8) (actual time=0.556..4.804 rows=40000.00 loops=1) │
│ -> Function Scan on generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual time=0.554..0.988 rows=10000.00 loops=1) │
│ Planning Time: 0.060 ms │
│ Execution Time: 676.436 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Time: 676.753 ms

With 0003:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks (cost=975.00..985.00 rows=1000 width=12) (actual time=87.730..89.453 rows=40000.00 loops=1) │
│ Buffers: shared hit=29997 read=10003 │
│ I/O Timings: shared read=50.467 │
│ InitPlan 1 │
│ -> Result (cost=0.00..975.00 rows=40000 width=4) (actual time=0.541..7.496 rows=40000.00 loops=1) │
│ -> ProjectSet (cost=0.00..375.00 rows=40000 width=8) (actual time=0.540..4.772 rows=40000.00 loops=1) │
│ -> Function Scan on generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual time=0.539..0.965 rows=10000.00 loops=1) │
│ Planning Time: 0.046 ms │
│ Execution Time: 90.661 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Time: 90.887 ms

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Andres Freund (#1)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Thu, Sep 11, 2025 at 5:46 PM Andres Freund <andres@anarazel.de> wrote:

The problem leading to that is that we are currently synchronously waiting for
IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
IO. If a block is read twice, that means we won't actually have enough IOs in
flight to have good performance. What's worse, currently the wait is not
attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
IO).

This patch no longer cleanly applies. Can you post a new version?

Thanks
--
Peter Geoghegan

Thomas Munro

thomas.munro@gmail.com

7 months ago

In reply to: Andres Freund (#1)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Fri, Sep 12, 2025 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
+     * It's possible that another backend starts IO on the buffer between this
+     * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+     * we will synchronously wait for the IO below, but the window for that is
+     * small enough that it won't happen often enough to have a significant
+     * performance impact.
+     */
+    if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
...
     /*
      * Check if we can start IO on the first to-be-read buffer.
      *
-     * If an I/O is already in progress in another backend, we want to wait
-     * for the outcome: either done, or something went wrong and we will
-     * retry.
+     * If a synchronous I/O is in progress in another backend (it can't be
+     * this backend), we want to wait for the outcome: either done, or
+     * something went wrong and we will retry.
      */
     if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))

"..., or an asynchronous IO was started after
ReadBuffersIOAlreadyInProgress() (unlikely), ..."?

I suppose (or perhaps vaguely recall from an off-list discussion?)
that you must have considered merging the new
is-it-already-in-progress check into ReadBuffersCanStartIO(). I
suppose the nowait argument would become a tri-state argument with a
value that means "don't wait for an in-progress read, just give me the
IO handle so I can 'join' it as a foreign waiter", with a new output
argument to receive the handle, or something along those lines, and I
guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
name. That'd remove the double-check (extra header lock-unlock cycle)
and associated race that can cause that rare synchronous wait (which
must still happen sometimes in the duelling concurrent scan use
case?), at the slight extra cost of having to allocate and free a
handle in the case of repeated blocks (eg the index->heap scan use
case), but at least that's just backend-local list pushups and doesn't
do extra work otherwise. Is there some logical problem with that
approach? Is the code just too clumsy?

Melanie Plageman

melanieplageman@gmail.com

4 months ago

In reply to: Thomas Munro (#3)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Sun, Nov 9, 2025 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I suppose (or perhaps vaguely recall from an off-list discussion?)
that you must have considered merging the new
is-it-already-in-progress check into ReadBuffersCanStartIO(). I
suppose the nowait argument would become a tri-state argument with a
value that means "don't wait for an in-progress read, just give me the
IO handle so I can 'join' it as a foreign waiter", with a new output
argument to receive the handle, or something along those lines, and I
guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
name. That'd remove the double-check (extra header lock-unlock cycle)
and associated race that can cause that rare synchronous wait (which
must still happen sometimes in the duelling concurrent scan use
case?), at the slight extra cost of having to allocate and free a
handle in the case of repeated blocks (eg the index->heap scan use
case), but at least that's just backend-local list pushups and doesn't
do extra work otherwise. Is there some logical problem with that
approach? Is the code just too clumsy?

Attached v3 basically does what you suggested above. Now, we should
only have to wait if the backend encounters a buffer after another
backend has set BM_IO_IN_PROGRESS but before that other backend has
set the buffer descriptor's wait reference.

0001 and 0002 are Andres' test-related patches. 0003 is a change I
think is required to make one of the tests stable (esp on the BSDs).
0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
IO concept but with your suggested structure and my suggested styling.
I could potentially break out more into smaller refactoring commits,
but I don't think it's too bad the way it is.

A few things about the patch that I'm not sure about:

- I don't know if pgaio_submit_staged() is in all the right places
(and not in too many places). I basically do it before we would wait
when starting read IO on the buffer. In the permanent buffers case,
that's now only when BM_IO_IN_PROGRESS is set but the wait reference
isn't valid yet. This can't happen in the temporary buffers case, so
I'm not sure we need to call pgaio_submit_staged().

- StartBufferIO() is no longer invoked in the AsyncReadBuffers() path.
We could refactor it so that it works for AsyncReadBuffers(), but that
would involve returning something that distinguishes between
IO_IN_PROGRESS and IO already done. And StartBufferIO()'s comment
explicitly says it wants to avoid that.
If we keep my structure, with AsyncReadBuffers() using its own helper
(PrepareNewReadBufferIO()) instead of StartBufferIO(), then it seems
like we need some way to make it clear what StartBufferIO() is for.
I'm not sure what would collectively describe its current users,
though. It also now has no non-test callers passing nowait as true.
However, once we add write combining, it will, so it seems like we
should leave it the way it is to avoid churn. However, other
developers might be confused in the interim.

- In the 004_read_stream tests, I wonder if there is a way to test
that we don't wait for foreign IO until WaitReadBuffers(). We have
tests for the stream accessing the same block, which in some cases
will exercise the foreign IO path. But it doesn't distinguish between
the old behavior -- waiting for the IO to complete when starting read
IO on it -- and the new behavior -- not waiting until
WaitReadBuffers(). That may not be possible to test, though.

- Melanie

Nazir Bilal Yavuz

byavuz81@gmail.com

4 months ago

In reply to: Melanie Plageman (#4)

Re: Don't synchronously wait for already-in-progress IO in read stream

Hi,

Thank you for working on this!

On Sat, 24 Jan 2026 at 00:04, Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sun, Nov 9, 2025 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I suppose (or perhaps vaguely recall from an off-list discussion?)
that you must have considered merging the new
is-it-already-in-progress check into ReadBuffersCanStartIO(). I
suppose the nowait argument would become a tri-state argument with a
value that means "don't wait for an in-progress read, just give me the
IO handle so I can 'join' it as a foreign waiter", with a new output
argument to receive the handle, or something along those lines, and I
guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
name. That'd remove the double-check (extra header lock-unlock cycle)
and associated race that can cause that rare synchronous wait (which
must still happen sometimes in the duelling concurrent scan use
case?), at the slight extra cost of having to allocate and free a
handle in the case of repeated blocks (eg the index->heap scan use
case), but at least that's just backend-local list pushups and doesn't
do extra work otherwise. Is there some logical problem with that
approach? Is the code just too clumsy?

Attached v3 basically does what you suggested above. Now, we should
only have to wait if the backend encounters a buffer after another
backend has set BM_IO_IN_PROGRESS but before that other backend has
set the buffer descriptor's wait reference.

0001 and 0002 are Andres' test-related patches. 0003 is a change I
think is required to make one of the tests stable (esp on the BSDs).
0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
IO concept but with your suggested structure and my suggested styling.
I could potentially break out more into smaller refactoring commits,
but I don't think it's too bad the way it is.

I confirm that I am able to produce the regression that Andres
mentioned with the patches excluding 0005, and 0005 fixes the
regression.

A few things about the patch that I'm not sure about:

- I don't know if pgaio_submit_staged() is in all the right places
(and not in too many places). I basically do it before we would wait
when starting read IO on the buffer. In the permanent buffers case,
that's now only when BM_IO_IN_PROGRESS is set but the wait reference
isn't valid yet. This can't happen in the temporary buffers case, so
I'm not sure we need to call pgaio_submit_staged().

I agree with you, I think we don't need to call pgaio_submit_staged()
for the temporary buffers case.

- StartBufferIO() is no longer invoked in the AsyncReadBuffers() path.
We could refactor it so that it works for AsyncReadBuffers(), but that
would involve returning something that distinguishes between
IO_IN_PROGRESS and IO already done. And StartBufferIO()'s comment
explicitly says it wants to avoid that.
If we keep my structure, with AsyncReadBuffers() using its own helper
(PrepareNewReadBufferIO()) instead of StartBufferIO(), then it seems
like we need some way to make it clear what StartBufferIO() is for.
I'm not sure what would collectively describe its current users,
though. It also now has no non-test callers passing nowait as true.
However, once we add write combining, it will, so it seems like we
should leave it the way it is to avoid churn. However, other
developers might be confused in the interim.

I don't have a comment for this.

- In the 004_read_stream tests, I wonder if there is a way to test
that we don't wait for foreign IO until WaitReadBuffers(). We have
tests for the stream accessing the same block, which in some cases
will exercise the foreign IO path. But it doesn't distinguish between
the old behavior -- waiting for the IO to complete when starting read
IO on it -- and the new behavior -- not waiting until
WaitReadBuffers(). That may not be possible to test, though.

Won't 'stream accessing the same block test' almost always test the
new behavior (not waiting until WaitReadBuffers())? Having dedicated
tests for both cases would be helpful, though.

My review:

0001:

0001 LGTM.
---------------

0002:

diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
b/src/test/modules/test_aio/t/004_read_stream.pl
+foreach my $method (TestAio::supported_io_methods())
+{
+    $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+    $node->start();
+    test_io_method($method, $node);
+    $node->stop();
+}

This seems wrong, we always test io_method=worker. I think we need to
replace 'worker' with the $method. Also, we can add check below to the
test_io_method function in the 004_read_stream.pl:
```
is($node->safe_psql('postgres', 'SHOW io_method'),
$io_method, "$io_method: io_method set correctly");
```

Other than that, 0002 LGTM.
---------------

0003:

0003 is a change I
think is required to make one of the tests stable (esp on the BSDs).

0003 LGTM.
---------------

0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
IO concept but with your suggested structure and my suggested styling.
I could potentially break out more into smaller refactoring commits,
but I don't think it's too bad the way it is.

0004:

Nitpick but I prefer something like TrackBufferHit() or
CountBufferHit() as a function name instead of ProcessBufferHit().
ProcessBufferHit() gives the impression that the function is doing a
job more than it currently does. Other than that, 0004 LGTM.
---------------

0005:

0005 LGTM. However, I am still looking into the AIO code. I wanted to
share my review so far.
---------------

--
Regards,
Nazir Bilal Yavuz
Microsoft

Peter Geoghegan

pg@bowt.ie

3 months ago

In reply to: Melanie Plageman (#4)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Fri, Jan 23, 2026 at 4:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Attached v3 basically does what you suggested above. Now, we should
only have to wait if the backend encounters a buffer after another
backend has set BM_IO_IN_PROGRESS but before that other backend has
set the buffer descriptor's wait reference.

Have you considered making ProcessBufferHit into an inline function? I
find that doing so meaningfully improves performance with the index
prefetching patch set. This is particularly true for cached index-only
scans with many VM buffer hits. And it seems to have no downside.

Right now, without any inlining, running perf against a backend that
executes such an index-only scan shows the function/symbol
"ProcessBufferHit.isra.0" as very hot. Apparently gcc does this isra
business ("Interprocedural Scalar Replacement of Aggregates") as an
optimization. Instead of passing the whole struct or pointer, the
caller is rewritten to extract just the necessary scalar values (like
an int or a bool) and pass those directly in registers. But we seem to
be better off fully inlining the function.

--
Peter Geoghegan

Melanie Plageman

melanieplageman@gmail.com

3 months ago

In reply to: Nazir Bilal Yavuz (#5)

Re: Don't synchronously wait for already-in-progress IO in read stream

Thanks for the review!

On Thu, Feb 5, 2026 at 11:56 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sat, 24 Jan 2026 at 00:04, Melanie Plageman
<melanieplageman@gmail.com> wrote:

- In the 004_read_stream tests, I wonder if there is a way to test
that we don't wait for foreign IO until WaitReadBuffers(). We have
tests for the stream accessing the same block, which in some cases
will exercise the foreign IO path. But it doesn't distinguish between
the old behavior -- waiting for the IO to complete when starting read
IO on it -- and the new behavior -- not waiting until
WaitReadBuffers(). That may not be possible to test, though.

Won't 'stream accessing the same block test' almost always test the
new behavior (not waiting until WaitReadBuffers())? Having dedicated
tests for both cases would be helpful, though.

Yea, I was thinking something like testing that if session A is
blocked completing read of block 2 and session B is requesting blocks
2-4 that buffers containing blocks 3 and 4 are valid when session B is
waiting on block 2 to finish.

I started working on something but it needed some new infrastructure
to check if the buffer is valid, and I wanted to see what others
thought first.

I did finally review Andres' test patches and have included my review
feedback here as well.

"aio: Refactor tests in preparation for more tests" (v4-0001) looks
good to me as well. I included one tiny idea AI suggested to me in a
follow-on patch (v4-0002).

diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
b/src/test/modules/test_aio/t/004_read_stream.pl
+foreach my $method (TestAio::supported_io_methods())
+{
+    $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+    $node->start();
+    test_io_method($method, $node);
+    $node->stop();
+}
This seems wrong, we always test io_method=worker. I think we need to
replace 'worker' with the $method. Also, we can add check below to the
test_io_method function in the 004_read_stream.pl:
```
is($node->safe_psql('postgres', 'SHOW io_method'),
$io_method, "$io_method: io_method set correctly");

Good catch. Fixed. I also found a few other small things in this patch
(v4-0003) which I put in v4-0004.

Some ideas I had that I didn't include in v4-0003 because its Andres
patch and is subjective:

For test_repeated_blocks, the first test:

# test miss of the same block twice in a row
$psql->query_safe(
qq/
SELECT evict_rel('largeish');
/);
$psql->query_safe(
qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
/);
ok(1, "$io_method: stream missing the same block repeatedly");

It says that it will miss the same block repeatedly, is that because
we won't start a read for any of the blocks until after
read_stream_get_block has returned all of them? If so, could be
clearer in the comment. Not everyone understands all the read stream
internals.

I know a lot of other tests do this, but I find it so hard to read the
test with the SQL is totally left-aligned like that -- especially with
comments interspersed. You can easily flow the queries on multiple
lines and indent it more.

For test_repeated_blocks, the second test:

# test hit of the same block twice in a row
$psql->query_safe(
qq/
SELECT evict_rel('largeish');
/);
$psql->query_safe(
qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4,
5, 6, 5, 4, 3, 2, 1, 0]);
/);
ok(1, "$io_method: stream accessing same block");

I assume that the second access of 2 is a hit because we actually did
IO for the first one (unlike in the earlier case)?

For test_inject_foreign:

In general, I am not ramped up enough on injection point stuff to know
if the actual new injection point stuff you added in test_aio.c is is
correct and optimal, but I did review the read stream additions to
test_aio.c and the tests added to 004_read_stream.pl.

For test_inject_foreign, the 3rd test:

# Test read stream encountering two buffers that are undergoing the same
# IO, started by another backend

I see that psql_b is requesting 3 blocks which can be combined into 1
IO, which makes it different than the 1st foreign IO test case:

###
# Test read stream encountering buffers undergoing IO in another backend,
# with the other backend's reads succeeding.
###

where psql_b only requests 1 but I don't really see how these are
covering different code. Maybe if the read stream one (psql_a) is
doing a combined IO it might exercise slightly different code, but
otherwise I don't get it.

Nitpick but I prefer something like TrackBufferHit() or
CountBufferHit() as a function name instead of ProcessBufferHit().
ProcessBufferHit() gives the impression that the function is doing a
job more than it currently does. Other than that, 0004 LGTM.

I changed this in attached v4.

- Melanie

Melanie Plageman

melanieplageman@gmail.com

3 months ago

In reply to: Peter Geoghegan (#6)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Mon, Feb 23, 2026 at 2:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

Have you considered making ProcessBufferHit into an inline function? I
find that doing so meaningfully improves performance with the index
prefetching patch set. This is particularly true for cached index-only
scans with many VM buffer hits. And it seems to have no downside.

Done in recently posted v4. Thanks for the report!

- Melanie

Melanie Plageman

melanieplageman@gmail.com

3 months ago

In reply to: Melanie Plageman (#7)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Tue, Mar 3, 2026 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Some ideas I had that I didn't include in v4-0003 because its Andres
patch and is subjective:

I was just looking at another patch and realized test_read_stream.c
exists. I wonder if any of the code this patch set adds to test_aio.c
should be there? On the one hand the foreign IO test is testing AIO
behavior and not really read stream behavior even though it invokes
the read stream. So maybe it doesn't really belong in
0004_read_stream.pl? The repeated blocks test is more of a read
stream test. Anyway, just a thought I had.

- Melanie

#10

Nazir Bilal Yavuz

byavuz81@gmail.com

3 months ago

In reply to: Melanie Plageman (#7)

Re: Don't synchronously wait for already-in-progress IO in read stream

Hi,

On Tue, 3 Mar 2026 at 22:47, Melanie Plageman <melanieplageman@gmail.com> wrote:

On Thu, Feb 5, 2026 at 11:56 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I did finally review Andres' test patches and have included my review
feedback here as well.

"aio: Refactor tests in preparation for more tests" (v4-0001) looks
good to me as well. I included one tiny idea AI suggested to me in a
follow-on patch (v4-0002).

This makes sense.

diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
b/src/test/modules/test_aio/t/004_read_stream.pl
+foreach my $method (TestAio::supported_io_methods())
+{
+    $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+    $node->start();
+    test_io_method($method, $node);
+    $node->stop();
+}
This seems wrong, we always test io_method=worker. I think we need to
replace 'worker' with the $method. Also, we can add check below to the
test_io_method function in the 004_read_stream.pl:
```
is($node->safe_psql('postgres', 'SHOW io_method'),
$io_method, "$io_method: io_method set correctly");
Good catch. Fixed. I also found a few other small things in this patch
(v4-0003) which I put in v4-0004.

These look good.

Some ideas I had that I didn't include in v4-0003 because its Andres
patch and is subjective:

For test_repeated_blocks, the first test:

# test miss of the same block twice in a row
$psql->query_safe(
qq/
SELECT evict_rel('largeish');
/);
$psql->query_safe(
qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
/);
ok(1, "$io_method: stream missing the same block repeatedly");

It says that it will miss the same block repeatedly, is that because
we won't start a read for any of the blocks until after
read_stream_get_block has returned all of them? If so, could be
clearer in the comment. Not everyone understands all the read stream
internals.

I think we start a read of blocks because we hit stream->distance but
it doesn't affect any consecutive same block numbers. What I
understood is:

Since io_combine_limit is 1, there won't be any IO combining.

0th block (0), miss, distance is 1; StartReadBuffersImpl() and
WaitReadBuffers() are called for 0th block.
1th block (2), miss, distance is 2, StartReadBuffersImpl() is called.
2th block (2), miss, distance is 2, StartReadBuffersImpl() and
WaitReadBuffers() are called 1th block.
3th block (4), miss, distance is 4, StartReadBuffersImpl() is called.
4th block (4), miss, distance is 4, StartReadBuffersImpl() and
WaitReadBuffers() are called 2, 3 and 4th blocks.

I know a lot of other tests do this, but I find it so hard to read the
test with the SQL is totally left-aligned like that -- especially with
comments interspersed. You can easily flow the queries on multiple
lines and indent it more.

I agree with you.

For test_repeated_blocks, the second test:

# test hit of the same block twice in a row
$psql->query_safe(
qq/
SELECT evict_rel('largeish');
/);
$psql->query_safe(
qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4,
5, 6, 5, 4, 3, 2, 1, 0]);
/);
ok(1, "$io_method: stream accessing same block");

I assume that the second access of 2 is a hit because we actually did
IO for the first one (unlike in the earlier case)?

I think so but to clarify, all second access of [2, 1, 0] blocks are hit; right?

For test_inject_foreign:

In general, I am not ramped up enough on injection point stuff to know
if the actual new injection point stuff you added in test_aio.c is is
correct and optimal, but I did review the read stream additions to
test_aio.c and the tests added to 004_read_stream.pl.

For test_inject_foreign, the 3rd test:

# Test read stream encountering two buffers that are undergoing the same
# IO, started by another backend

I see that psql_b is requesting 3 blocks which can be combined into 1
IO, which makes it different than the 1st foreign IO test case:

###
# Test read stream encountering buffers undergoing IO in another backend,
# with the other backend's reads succeeding.
###

where psql_b only requests 1 but I don't really see how these are
covering different code. Maybe if the read stream one (psql_a) is
doing a combined IO it might exercise slightly different code, but
otherwise I don't get it.

I think the main difference is that:

###
# Test read stream encountering buffers undergoing IO in another backend,
# with the other backend's reads succeeding.
###

SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
ARRAY[0, 2, 5, 7]);

We need to join waiting block number 5 and then start another IO for
block number 7.

# Test read stream encountering two buffers that are undergoing the same
# IO, started by another backend

SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
ARRAY[0, 2, 4]);

We need to join waiting block number 2 but after waiting for an IO, IO
for block number 4 should be already completed too. We don't need to
start IO like the other case.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#11

Melanie Plageman

melanieplageman@gmail.com

3 months ago

In reply to: Nazir Bilal Yavuz (#10)

Re: Don't synchronously wait for already-in-progress IO in read stream

Thanks for the continued review!

Attached v5 adds some comments to the tests, fixes a few nits in the
actual code, and adds a commit to fix what I think is an existing
off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.

On Fri, Mar 6, 2026 at 8:18 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

For test_repeated_blocks, the first test:

# test miss of the same block twice in a row
$psql->query_safe(
qq/
SELECT evict_rel('largeish');
/);
$psql->query_safe(
qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
/);
ok(1, "$io_method: stream missing the same block repeatedly");

It says that it will miss the same block repeatedly, is that because
we won't start a read for any of the blocks until after
read_stream_get_block has returned all of them? If so, could be
clearer in the comment. Not everyone understands all the read stream
internals.

I think we start a read of blocks because we hit stream->distance but
it doesn't affect any consecutive same block numbers. What I
understood is:

Since io_combine_limit is 1, there won't be any IO combining.

0th block (0), miss, distance is 1; StartReadBuffersImpl() and
WaitReadBuffers() are called for 0th block.
1th block (2), miss, distance is 2, StartReadBuffersImpl() is called.
2th block (2), miss, distance is 2, StartReadBuffersImpl() and
WaitReadBuffers() are called 1th block.
3th block (4), miss, distance is 4, StartReadBuffersImpl() is called.
4th block (4), miss, distance is 4, StartReadBuffersImpl() and
WaitReadBuffers() are called 2, 3 and 4th blocks.

Makes sense. I've tried to add a comment to this effect.

I know a lot of other tests do this, but I find it so hard to read the
test with the SQL is totally left-aligned like that -- especially with
comments interspersed. You can easily flow the queries on multiple
lines and indent it more.

I agree with you.

I did reflow the SQL. It does mean there will be a bunch of extra
whitespace sent to the server. Other tests do this, though. I wonder
how much it affects performance...

For test_repeated_blocks, the second test:

# test hit of the same block twice in a row
$psql->query_safe(
qq/
SELECT evict_rel('largeish');
/);
$psql->query_safe(
qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4,
5, 6, 5, 4, 3, 2, 1, 0]);
/);
ok(1, "$io_method: stream accessing same block");

I assume that the second access of 2 is a hit because we actually did
IO for the first one (unlike in the earlier case)?

I think so but to clarify, all second access of [2, 1, 0] blocks are hit; right?

Yes. I tried expanding the comment to elaborate, but it just came out
awkward, so I left it the way it is.

For test_inject_foreign, the 3rd test:

# Test read stream encountering two buffers that are undergoing the same
# IO, started by another backend

I see that psql_b is requesting 3 blocks which can be combined into 1
IO, which makes it different than the 1st foreign IO test case:

###
# Test read stream encountering buffers undergoing IO in another backend,
# with the other backend's reads succeeding.
###

where psql_b only requests 1 but I don't really see how these are
covering different code. Maybe if the read stream one (psql_a) is
doing a combined IO it might exercise slightly different code, but
otherwise I don't get it.

I think the main difference is that:

###
# Test read stream encountering buffers undergoing IO in another backend,
# with the other backend's reads succeeding.
###

SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
ARRAY[0, 2, 5, 7]);

We need to join waiting block number 5 and then start another IO for
block number 7.

# Test read stream encountering two buffers that are undergoing the same
# IO, started by another backend

SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
ARRAY[0, 2, 4]);

We need to join waiting block number 2 but after waiting for an IO, IO
for block number 4 should be already completed too. We don't need to
start IO like the other case.

Ah, makes sense. Thanks!

- Melanie

#12

Andres Freund

andres@anarazel.de

2 months ago

In reply to: Melanie Plageman (#11)

Re: Don't synchronously wait for already-in-progress IO in read stream

Hi,

Attached v5 adds some comments to the tests, fixes a few nits in the
actual code, and adds a commit to fix what I think is an existing
off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.

Subject: [PATCH v5 3/5] Fix off-by-one error in read IO tracing

---
src/backend/storage/buffer/bufmgr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc609529a..0723d4f3dd8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
operation->smgr->smgr_rlocator.locator.spcOid,
operation->smgr->smgr_rlocator.locator.dbOid,
operation->smgr->smgr_rlocator.locator.relNumber,
--

Ah, the issue is that we already incremented nblocks_done, right? Maybe it'd
be easier to understand if we stashed blocknum + nblocks_done into a local
var, and use it in in both branches of if (!ReadBuffersCanStartIO())?

This probably needs to be backpatched...

Subject: [PATCH v5 4/5] Make buffer hit helper

Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: /messages/by-id/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
---
src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
1 file changed, 56 insertions(+), 55 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0723d4f3dd8..399004c2e44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
bool *foundPtr, IOContext io_context);
static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
+
+static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy strategy,
+													  Relation rel, char persistence, SMgrRelation smgr,
+													  ForkNumber forknum, BlockNumber blocknum);
static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
IOObject io_object, IOContext io_context);
@@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel,
bool *foundPtr)
{
BufferDesc *bufHdr;
-	IOContext	io_context;
-	IOObject	io_object;
Assert(blockNum != P_NEW);

@@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
persistence == RELPERSISTENCE_PERMANENT ||
persistence == RELPERSISTENCE_UNLOGGED));

- if (persistence == RELPERSISTENCE_TEMP)
- {
- io_context = IOCONTEXT_NORMAL;
- io_object = IOOBJECT_TEMP_RELATION;
- }
- else
- {
- io_context = IOContextForStrategy(strategy);
- io_object = IOOBJECT_RELATION;
- }
-

I'm mildly worried that this will lead to a bit worse code generation, the
compiler might have a harder time figuring out that io_context/io_object
doesn't change across multiple PinBufferForBlock calls. Although it already
might be unable to do so (we don't mark IOContextForStrategy as
pure [1]https://gcc.gnu.org/onlinedocs/gcc/Common-Attributes.html#index-pure).

I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite
direction, and explicitly look up IOContextForStrategy(strategy) *before* the
actual_nblocks loop to make sure the compiler doesn't inject external function
calls (which will in all likelihood require register spilling etc).

I don't think that necessarily has to conflict with the goal of this patch -
most of the the deduplicated stuff isn't io_context, so the helper will be
beneficial even if have to pull out the io_context/io_object determination to
the callsites.

TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
@@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
smgr->smgr_rlocator.backend);
if (persistence == RELPERSISTENCE_TEMP)
-	{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
-		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
-	}
else
-	{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
-							 strategy, foundPtr, io_context);
-		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
-	}
+							 strategy, foundPtr, IOContextForStrategy(strategy));
+
if (rel)
{
/*

And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
branch in CountBufferHit(), I suspect the compiler may not be able to optimize
it away.

At the very least I'd invert the call to CountBufferHit() and the
pgstat_count_buffer_read(), as the latter will probably prevent most
optimizations (due to the compiler not being able to prove that
(rel)->pgstat_info->counts.blocks_fetched is a different memory location as
*foundPtr).

+/*
+ * We track various stats related to buffer hits. Because this is done in a
+ * few separate places, this helper exists for convenience.
+ */
+static pg_attribute_always_inline void
+CountBufferHit(BufferAccessStrategy strategy,
+			   Relation rel, char persistence, SMgrRelation smgr,
+			   ForkNumber forknum, BlockNumber blocknum)
+{
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
+									  blocknum,
+									  smgr->smgr_rlocator.locator.spcOid,
+									  smgr->smgr_rlocator.locator.dbOid,
+									  smgr->smgr_rlocator.locator.relNumber,
+									  smgr->smgr_rlocator.backend,
+									  true);
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_hit += 1;
+	else
+		pgBufferUsage.shared_blks_hit += 1;
+
+	if (rel)
+		pgstat_count_buffer_hit(rel);
+
+	pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageHit;
+}

I don't think "Count*" is a great name for something that does tracepoints and
vacuum cost balance accounting, the latter actually changes behavior of the
program due to the sleeps it injects.

The first alternative I have is AccountForBufferHit(), not great, but still
seems a bit better.

From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 23 Jan 2026 14:00:31 -0500
Subject: [PATCH v5 5/5] Don't wait for already in-progress IO
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When a backend attempts to start a read on a buffer and finds that I/O
is already in progress, it previously waited for that I/O to complete
before initiating reads for any other buffers. Although the backend must
still wait for the I/O to finish when later acquiring the buffer, it
should not need to wait at read start time. Other buffers may be
available for I/O, and in some workloads this waiting significantly
reduces concurrency.

For example, index scans may repeatedly request the same heap block. If
the backend waits each time it encounters an in-progress read, the
access pattern effectively degenerates into synchronous I/O. By
introducing the concept of foreign I/O operations, a backend can record
the buffer’s wait reference and defer waiting until WaitReadBuffers()
when it actually acquires the buffer.

In rare cases, a backend may still need to wait when starting a read if
it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
but before the buffer descriptor’s wait reference has been set. Such
windows should be brief and uncommon.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: /messages/by-id/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv

+/*
+ * In AsyncReadBuffers(), when preparing a buffer for reading and setting
+ * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
+ * already contain the desired block. AsyncReadBuffers() must distinguish
+ * between these cases (and the case where it should initiate I/O) so it can
+ * mark an in-progress buffer as foreign I/O rather than waiting on it.
+ */
+typedef enum PrepareReadBuffer_Status
+{
+	READ_BUFFER_ALREADY_DONE,
+	READ_BUFFER_IN_PROGRESS,
+	READ_BUFFER_READY_FOR_IO,
+} PrepareReadBuffer_Status;

I don't personally like mixing underscore and camel case naming within one
name.

I wonder if might be worth splitting this up in a refactoring and a
"behavioural change" commit. Might be too complicated.

Candidates for a split seem to be:
- Moving pgaio_io_acquire_nb() to earlier
- Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support
for READ_BUFFER_IN_PROGRESS
- introduce READ_BUFFER_IN_PROGRESS

/*
* We track various stats related to buffer hits. Because this is done in a
* few separate places, this helper exists for convenience.
@@ -1815,8 +1791,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
* b) reports some time as waiting, even if we never waited
*
* we first check if we already know the IO is complete.
+			 *
+			 * Note that operation->io_return is uninitialized for foreign IO,
+			 * so we cannot count that wait time.
*/

I'm confused - your comment says we can't count wait time with a foreign IO,
but then oes on to count foreign IO time? The lack of io_return just means we
can't do the cheaper pre-check for PGAIO_RS_UNKNOWN, no?

-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
!pgaio_wref_check_done(&operation->io_wref))
{
instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1835,11 +1814,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
Assert(pgaio_wref_check_done(&operation->io_wref));
}

-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc = BufferIsLocal(buffer) ?
+					GetLocalBufferDescriptor(-buffer - 1) :
+					GetBufferDescriptor(buffer - 1);
+				uint32		buf_state = pg_atomic_read_u64(&desc->state);
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					CountBufferHit(operation->strategy,
+								   operation->rel, operation->persistence,
+								   operation->smgr, operation->forknum,
+								   operation->blocknum + operation->nblocks_done - 1);

Probably worth including a comment explaining why we count this as a hit. IIRC
earlier versions had such a comment.

+/*
+ * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static PrepareReadBuffer_Status
+PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
+							Buffer buffer)

Hm, seems the test in 0002 should be extended to cover the the temp table case.

+{
+	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
+	uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+	/* Already valid, no work to do */
+	if (buf_state & BM_VALID)
+	{
+		pgaio_wref_clear(&operation->io_wref);
+		return READ_BUFFER_ALREADY_DONE;
+	}

Is this reachable for local buffers?

+	pgaio_submit_staged();
+
+	if (pgaio_wref_valid(&desc->io_wref))
+	{
+		operation->io_wref = desc->io_wref;
+		operation->foreign_io = true;
+		return READ_BUFFER_IN_PROGRESS;
+	}
+
+	/*
+	 * While it is possible for a buffer to have been prepared for IO but not
+	 * yet had its wait reference set, there's no way for us to know that for
+	 * temporary buffers. Thus, we'll prepare for own IO on this buffer.
+	 */
+	return READ_BUFFER_READY_FOR_IO;

Is that actually possible? And would it be ok to just do start IO in that
case?

+/*
+ * Try to start IO on the first buffer in a new run of blocks. If AIO is in
+ * progress, be it in this backend or another backend, we just associate the
+ * wait reference with the operation and wait in WaitReadBuffers(). This turns
+ * out to be important for performance in two workloads:
+ *
+ * 1) A read stream that has to read the same block multiple times within the
+ *    readahead distance. This can happen e.g. for the table accesses of an
+ *    index scan.
+ *
+ * 2) Concurrent scans by multiple backends on the same relation.
+ *
+ * If we were to synchronously wait for the in-progress IO, we'd not be able
+ * to keep enough I/O in flight.
+ *
+ * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+ * ReadBuffersOperation that WaitReadBuffers then can wait on.
+ *
+ * It's possible that another backend has started IO on the buffer but not yet
+ * set its wait reference. In this case, we have no choice but to wait for
+ * either the wait reference to be valid or the IO to be done.
+ */
+static PrepareReadBuffer_Status
+PrepareNewReadBufferIO(ReadBuffersOperation *operation,
+					   Buffer buffer)
+{

I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
"Continue"? Or "First" & "Additional"? Or ...

+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+		return PrepareNewLocalReadBufferIO(operation, buffer);
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	desc = GetBufferDescriptor(buffer - 1);
+
+	for (;;)
+	{
+		buf_state = LockBufHdr(desc);

Perhaps worth adding an
Assert(buf_state & BM_TAG_VALID)?

+		/* Already valid, no work to do */
+		if (buf_state & BM_VALID)
+		{
+			UnlockBufHdr(desc);
+			pgaio_wref_clear(&operation->io_wref);

Perhaps we should clear &operation->io_wref once at the start? Because right
now it'll be cleared if BM_VALID and it'll be set if BM_IO_IN_PROGRESS &&
&desc->io_wref, but it won't be touched when in BM_IO_IN_PROGRESS without a
wref set. It seems either we should just touch &operation->io_wref if
BM_IO_IN_PROGRESS && pgaio_wref_valid(&desc->io_wref)
or we should reliably do it.

+			return READ_BUFFER_ALREADY_DONE;
+		}
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			/* Join existing read */
+			if (pgaio_wref_valid(&desc->io_wref))
+			{
+				operation->io_wref = desc->io_wref;
+				operation->foreign_io = true;
+				UnlockBufHdr(desc);
+				return READ_BUFFER_IN_PROGRESS;
+			}
+
+			/*
+			 * If the wait ref is not valid but the IO is in progress, someone
+			 * else started IO but hasn't set the wait ref yet. We have no
+			 * choice but to wait until the IO completes.
+			 */
+			UnlockBufHdr(desc);
+			pgaio_submit_staged();
+			WaitIO(desc);
+			continue;

Before this commit there was an explanation for this submit:

- /*
- * If this backend currently has staged IO, we need to submit the pending
- * IO before waiting for the right to issue IO, to avoid the potential for
- * deadlocks (and, more commonly, unnecessary delays for other backends).
- */

Seems that vanished?

+/*
+ * When building a new IO from multiple buffers, we won't include buffers
+ * that are already valid or already in progress. This function should only be
+ * used for additional adjacent buffers following the head buffer in a new IO.
+ *
+ * Returns true if the buffer was successfully prepared for IO and false if it
+ * is rejected and the read IO should not include this buffer.
+ */
+static bool
+PrepareAdditionalReadBuffer(Buffer buffer)

I think it'd be good to mention that this may never wait for IO or such to
avoid deadlocks.

+	/* Check if we can start IO on the first to-be-read buffer */
+	if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
+		READ_BUFFER_READY_FOR_IO)
+	{

I don't love this < bit. For one there's no mention in
PrepareReadBuffer_Status mentioning that the numerical order is important. Any
reason to not just test != READ_BUFFER_READY_FOR_IO?

The assignment inside the if also looks somewhat awkward. For while() loops
there's often not really a better way to write it, but here you could just as
well do the status assignment in a line before.

+		pgaio_io_release(ioh);
+		*nblocks_progress = 1;
+		if (status == READ_BUFFER_ALREADY_DONE)
+		{
+			/*
+			 * Someone else has already completed this block, we're done.
+			 *
+			 * When IO is necessary, ->nblocks_done is updated in
+			 * ProcessReadBuffersResult(), but that is not called if no IO is
+			 * necessary. Thus update here.
+			 */
+			operation->nblocks_done += 1;
+			Assert(operation->nblocks_done <= operation->nblocks);
+
+			/*
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock(). The
+			 * other backend will track this as a 'read'.
+			 */
+			CountBufferHit(operation->strategy,
+						   operation->rel, operation->persistence,
+						   operation->smgr, operation->forknum,
+						   operation->blocknum + operation->nblocks_done - 1);
+			return false;
+		}
+
+		/* The IO is already in-progress */
+		Assert(status == READ_BUFFER_IN_PROGRESS);
+		CheckReadBuffersOperation(operation, false);

I was about to suggest that there should be a CheckReadBuffersOperation() for
both returns here, but there already are CheckReadBuffersOperation after calls
to AsyncReadBuffers(), so I think this CheckReadBuffersOperation could just be
removed.

/*
-	 * Check if we can start IO on the first to-be-read buffer.
-	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * How many neighboring-on-disk blocks can we scatter-read into other
+	 * buffers at the same time?  In this case we don't wait if we see an I/O
+	 * already in progress.  We already set BM_IO_IN_PROGRESS for the head
+	 * block, so we should get on with that I/O as soon as possible.
*/
-	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	for (int i = nblocks_done + 1; i < operation->nblocks; i++)
{
-		/*
-		 * Someone else has already completed this block, we're done.
-		 *
-		 * When IO is necessary, ->nblocks_done is updated in
-		 * ProcessReadBuffersResult(), but that is not called if no IO is
-		 * necessary. Thus update here.
-		 */
-		operation->nblocks_done += 1;
-		*nblocks_progress = 1;
-
-		pgaio_io_release(ioh);
-		pgaio_wref_clear(&operation->io_wref);
-		did_start_io = false;
+		if (!PrepareAdditionalReadBuffer(buffers[i]))
+			break;
+		/* Must be consecutive block numbers. */
+		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+			   BufferGetBlockNumber(buffers[i]) - 1);

Seems this assert could stand to be before the PrepareAdditionalReadBuffer(),
as it better hold true before we try to BM_IO_IN_PROGRESS?

I realize this is old, but since you're whacking this around...

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4017896f951..f85a9acc6ac 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,8 @@ struct ReadBuffersOperation
int			flags;
int16		nblocks;
int16		nblocks_done;
+	/* true if waiting on another backend's IO */
+	bool		foreign_io;
PgAioWaitRef io_wref;
PgAioReturn io_return;
};

This adds an alignment-padding hole between nblocks_done and io_wref. Read
stream can allocate quite a few of these, so it's probably worth avoiding?

There's a padding hole between persistence and forknum, but that seems a bit
ugly to utilize. A bit less ugly if we swapped forknum and persistence.

Or we could make 'flags' a uint8/16 (flags should imo always be unsigned, and
there are just four flag bits right now).

But perhaps it's also not worth worrying about right now.

[1]: https://gcc.gnu.org/onlinedocs/gcc/Common-Attributes.html#index-pure

Greetings,

Andres Freund

#13

Melanie Plageman

melanieplageman@gmail.com

2 months ago

In reply to: Andres Freund (#12)

Re: Don't synchronously wait for already-in-progress IO in read stream

Thanks for the review!

Everything you suggested that I don't elaborate on below, I've just
gone ahead and done in attached v6.

On Tue, Mar 17, 2026 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
-             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
operation->smgr->smgr_rlocator.locator.spcOid,
operation->smgr->smgr_rlocator.locator.dbOid,
operation->smgr->smgr_rlocator.locator.relNumber,
--

This probably needs to be backpatched...

0003 in v6 does as you suggest. I'll backport it after a quick +1 here.

Subject: [PATCH v5 4/5] Make buffer hit helper

@@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
persistence == RELPERSISTENCE_PERMANENT ||
persistence == RELPERSISTENCE_UNLOGGED));

- if (persistence == RELPERSISTENCE_TEMP)
- {
- io_context = IOCONTEXT_NORMAL;
- io_object = IOOBJECT_TEMP_RELATION;
- }
- else
- {
- io_context = IOContextForStrategy(strategy);
- io_object = IOOBJECT_RELATION;
- }
-

I'm mildly worried that this will lead to a bit worse code generation, the
compiler might have a harder time figuring out that io_context/io_object
doesn't change across multiple PinBufferForBlock calls. Although it already
might be unable to do so (we don't mark IOContextForStrategy as
pure [1]).

I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite
direction, and explicitly look up IOContextForStrategy(strategy) *before* the
actual_nblocks loop to make sure the compiler doesn't inject external function
calls (which will in all likelihood require register spilling etc).

I added a separate patch to refactor the code to do this first (0004).

@@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
smgr->smgr_rlocator.backend);

if (persistence == RELPERSISTENCE_TEMP)

And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
branch in CountBufferHit(), I suspect the compiler may not be able to optimize
it away.

And you think it is optimizing it away in PinBufferForBlock()?

At the very least I'd invert the call to CountBufferHit() and the
pgstat_count_buffer_read(), as the latter will probably prevent most
optimizations (due to the compiler not being able to prove that
(rel)->pgstat_info->counts.blocks_fetched is a different memory location as
*foundPtr).

I did this. I did not check the compiled code before or after though.

+CountBufferHit(BufferAccessStrategy strategy,
+                        Relation rel, char persistence, SMgrRelation smgr,
+                        ForkNumber forknum, BlockNumber blocknum)
I don't think "Count*" is a great name for something that does tracepoints and
vacuum cost balance accounting, the latter actually changes behavior of the
program due to the sleeps it injects.

The first alternative I have is AccountForBufferHit(), not great, but still
seems a bit better.

At some point, I had ProcessBufferHit(), but Bilal felt it implied the
function did more than counting. I've changed it now to
TrackBufferHit().

From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 23 Jan 2026 14:00:31 -0500
Subject: [PATCH v5 5/5] Don't wait for already in-progress IO

I wonder if might be worth splitting this up in a refactoring and a
"behavioural change" commit. Might be too complicated.

Candidates for a split seem to be:
- Moving pgaio_io_acquire_nb() to earlier
- Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support
for READ_BUFFER_IN_PROGRESS
- introduce READ_BUFFER_IN_PROGRESS

I've done something like this in v6.

+ * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static PrepareReadBuffer_Status
+PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
+                                                     Buffer buffer)

Hm, seems the test in 0002 should be extended to cover the the temp table case.

I did this. However, I was a bit lazy in how many cases I added
because I used invalidate_rel_block(), which is pretty verbose (since
evict_rel() doesn't work yet for local buffers).

I don't think we'll be able to easily test READ_BUFFER_ALREADY_DONE
(though perhaps we aren't testing it for shared buffers either?).

+{
+     BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
+     uint64          buf_state = pg_atomic_read_u64(&desc->state);
+
+     /* Already valid, no work to do */
+     if (buf_state & BM_VALID)
+     {
+             pgaio_wref_clear(&operation->io_wref);
+             return READ_BUFFER_ALREADY_DONE;
+     }

Is this reachable for local buffers?

Yes, I think this is reachable by local buffers that started the IO
already and then completed it when acquiring a new IO handle at the
top of AsyncReadBuffers().

+     if (pgaio_wref_valid(&desc->io_wref))
+     {
+             operation->io_wref = desc->io_wref;
+             operation->foreign_io = true;
+             return READ_BUFFER_IN_PROGRESS;
+     }
+
+     /*
+      * While it is possible for a buffer to have been prepared for IO but not
+      * yet had its wait reference set, there's no way for us to know that for
+      * temporary buffers. Thus, we'll prepare for own IO on this buffer.
+      */
+     return READ_BUFFER_READY_FOR_IO;

Is that actually possible? And would it be ok to just do start IO in that
case?

You're right, that's not possible for local buffers. For local
buffers, we "prepare for IO" by calling PrepareNewLocalReadBufferIO()
and then set the wait ref in a codepath initiated by calling
smgrstartreadv() as part of "staging" the IO. No one can observe that
buffer in between the call to PrepareNewLocalReadBufferIO() and
setting the wait reference. So, I've deleted the comment.

+static PrepareReadBuffer_Status
+PrepareNewReadBufferIO(ReadBuffersOperation *operation,
+                                        Buffer buffer)
+{
I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
"Continue"? Or "First" & "Additional"? Or ...

I changed the names to PrepareHeadBufferReadIO() and
PrepareAdditionalBufferReadIO(). "Head" instead of "First" because
First felt like it implied the first buffer ever and head seems to
make it clear it is the first buffer of this new IO.

- Melanie

#14

Andres Freund

andres@anarazel.de

2 months ago

In reply to: Melanie Plageman (#13)

Re: Don't synchronously wait for already-in-progress IO in read stream

Hi,

On 2026-03-18 12:59:11 -0400, Melanie Plageman wrote:

On Tue, Mar 17, 2026 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
-             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
operation->smgr->smgr_rlocator.locator.spcOid,
operation->smgr->smgr_rlocator.locator.dbOid,
operation->smgr->smgr_rlocator.locator.relNumber,
--
Ah, the issue is that we already incremented nblocks_done, right? Maybe it'd
be easier to understand if we stashed blocknum + nblocks_done into a local
var, and use it in in both branches of if (!ReadBuffersCanStartIO())?

This probably needs to be backpatched...
0003 in v6 does as you suggest. I'll backport it after a quick +1 here.

LGTM.

@@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
smgr->smgr_rlocator.backend);

if (persistence == RELPERSISTENCE_TEMP)

And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
branch in CountBufferHit(), I suspect the compiler may not be able to optimize
it away.

And you think it is optimizing it away in PinBufferForBlock()?

It doesn't really have a choice :) - the
pgBufferUsage.(local|shared)_blks_hit++
is within the already required
if (persistence == RELPERSISTENCE_TEMP)

so there's not really a branch to optimize away.

But maybe I misunderstood?

At the very least I'd invert the call to CountBufferHit() and the
pgstat_count_buffer_read(), as the latter will probably prevent most
optimizations (due to the compiler not being able to prove that
(rel)->pgstat_info->counts.blocks_fetched is a different memory location as
*foundPtr).

I did this. I did not check the compiled code before or after though.

I checked after and it looks good (well, ok enough, but that's unrelated to
your changes).

Just to verify that any of this actually matters, I did some benchmarking with
the call to TrackBufferHit() removed, and with pg_prewarm() of a scale 100
pgbench_accounts I do see about ~3% of a gain from that. I did also verify
that with the patch we're ever so slightly, but reproducibly, faster than
master. There's future optimization potential for sure though.

+CountBufferHit(BufferAccessStrategy strategy,
+                        Relation rel, char persistence, SMgrRelation smgr,
+                        ForkNumber forknum, BlockNumber blocknum)
I don't think "Count*" is a great name for something that does tracepoints and
vacuum cost balance accounting, the latter actually changes behavior of the
program due to the sleeps it injects.

The first alternative I have is AccountForBufferHit(), not great, but still
seems a bit better.
At some point, I had ProcessBufferHit(), but Bilal felt it implied the
function did more than counting. I've changed it now to
TrackBufferHit().

WFM.

+ * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static PrepareReadBuffer_Status
+PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
+                                                     Buffer buffer)
Hm, seems the test in 0002 should be extended to cover the the temp table case.
I did this. However, I was a bit lazy in how many cases I added
because I used invalidate_rel_block(), which is pretty verbose (since
evict_rel() doesn't work yet for local buffers).

Ah, yea, that's annoying. I think some basic coverage is good enough for now.

I don't think we'll be able to easily test READ_BUFFER_ALREADY_DONE
(though perhaps we aren't testing it for shared buffers either?).

We do reach the READ_BUFFER_ALREADY_DONE in PrepareHeadBufferReadIO(), but
only due to io_method=sync peculiarities (as that only actually performs the
IO later when waiting, it's easy to have two IOs for the same block).

It's probably worth adding tests for that, although I suspect it should be in
001_aio.pl - no read stream required to hit it. I can give it a shot, if you
want?

+static PrepareReadBuffer_Status
+PrepareNewReadBufferIO(ReadBuffersOperation *operation,
+                                        Buffer buffer)
+{
I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
"Continue"? Or "First" & "Additional"? Or ...
I changed the names to PrepareHeadBufferReadIO() and
PrepareAdditionalBufferReadIO(). "Head" instead of "First" because
First felt like it implied the first buffer ever and head seems to
make it clear it is the first buffer of this new IO.

Head works!

Subject: [PATCH v6 4/8] Pass io_object and io_context through to
PinBufferForBlock()

The duplication due to handling the RBM_ZERO_AND_CLEANUP_LOCK case is a bit
annoying, but I think it's still an improvement.

Subject: [PATCH v6 5/8] Make buffer hit helper

LGTM.

From b73b896febc35253ca2607cb0fe143355b91256f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 18 Mar 2026 11:09:58 -0400
Subject: [PATCH v6 6/8] Restructure AsyncReadBuffers()

Restructure AsyncReadBuffers() to use early return when the head buffer
is already valid, instead of using a did_start_io flag and if/else
branches. Also move around a bit of the code to be located closer to
where it is used. This is a refactor only.

I think there should be as little work as posbile between setting
IO_IN_PROGRESS and starting the IO. Most of the work you deferred is cheap
enough that it shouldn't matter, but pgstat_prepare_report_checksum_failure()
might need to do a bit more (including taking lwlocks and stuff).

I'm also a bit doubtful that deferring the flag determinations is a good idea,
mostly because it adds a bunch of stuff between starting IO on the head and
subsequent buffers. Not that it's expensive, but it seems to make it more
likely that somebody would end up putting other code inbetween the head and
additional buffer IO starts. And it's cheap enough that it doesn't matter to
waste it if we return early.

+/*
+ * When building a new IO from multiple buffers, we won't include buffers
+ * that are already valid or already in progress. This function should only be
+ * used for additional adjacent buffers following the head buffer in a new IO.
+ *
+ * This function must never wait for IO to avoid deadlocks. The head buffer
+ * already has BM_IO_IN_PROGRESS set, so we'll just issue that IO and come
+ * back in lieu of waiting here.

"come back" is a bit odd, since you'd not actually come back to
PrepareAdditionalBufferReadIO().

Looks good otherwise.

From 63cb731176a62320d296f968b12a5d4d36e703d0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 18 Mar 2026 11:17:57 -0400
Subject: [PATCH v6 8/8] AIO: Don't wait for already in-progress IO

When a backend attempts to start a read IO and finds the first buffer
already has I/O in progress, previously it waited for that I/O to
complete before initiating reads for any of the subsequent buffers.

Although the backend must wait for the I/O to finish when acquiring the
buffer, there's no reason for it to wait when setting up the read
operation. Waiting at this point prevents the backend from starting I/O
on subsequent buffers and can significantly reduce concurrency.

This matters in two workloads: when multiple backends scan the same
relation concurrently, and when a single backend requests the same block
multiple times within the readahead distance.

If backends wait each time they encounter an in-progress read,
the access pattern effectively degenerates into synchronous I/O.

To fix this, when encountering an already in-progress IO for the head
buffer, a backend now records the buffer's wait reference and defers
waiting until WaitReadBuffers(), when it actually needs to acquire the
buffer.

In rare cases, a backend may still need to wait synchronously at IO
start time: if another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.

@@ -1663,8 +1677,9 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
* Local version of PrepareHeadBufferReadIO(). Here instead of localbuf.c to
* avoid an external function call.
*/
-static bool
-PrepareHeadLocalBufferReadIO(Buffer buffer)
+static PrepareReadBufferStatus
+PrepareHeadLocalBufferReadIO(ReadBuffersOperation *operation,
+							 Buffer buffer)
{
BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
uint64		buf_state = pg_atomic_read_u64(&desc->state);
@@ -1675,49 +1690,60 @@ PrepareHeadLocalBufferReadIO(Buffer buffer)
* handle). Only the owning backend can set BM_VALID on a local buffer.
*/
if (buf_state & BM_VALID)
-		return false;
+		return READ_BUFFER_ALREADY_DONE;

/*
* Submit any staged IO before checking for in-progress IO. Without this,
* the wref check below could find IO that this backend staged but hasn't
-	 * submitted yet. Waiting on that would PANIC because the owner can't wait
-	 * on its own staged IO.
+	 * submitted yet. If we returned READ_BUFFER_IN_PROGRESS and
+	 * WaitReadBuffers() then tried to wait on it, we'd PANIC because the
+	 * owner can't wait on its own staged IO.
*/
pgaio_submit_staged();

-	/* Wait for in-progress IO */
+	/* We've already asynchronously started this IO, so join it */
if (pgaio_wref_valid(&desc->io_wref))
{
-		PgAioWaitRef iow = desc->io_wref;
-
-		pgaio_wref_wait(&iow);
-
-		buf_state = pg_atomic_read_u64(&desc->state);
+		operation->io_wref = desc->io_wref;
+		operation->foreign_io = true;
+		return READ_BUFFER_IN_PROGRESS;
}

-	/*
-	 * If BM_VALID is set, we waited on IO and it completed successfully.
-	 * Otherwise, we'll initiate IO on the buffer.
-	 */
-	return !(buf_state & BM_VALID);
+	/* Prepare to start IO on this buffer */
+	return READ_BUFFER_READY_FOR_IO;
}

Hm. Is buf_state & BM_VALID actually not reachable anymore? What if the
pgaio_submit_staged() completed the IO? In that case there won't be a wref and
we'll get here with buf_state & BM_VALID.

@@ -1732,11 +1758,25 @@ PrepareHeadBufferReadIO(Buffer buffer)
if (buf_state & BM_VALID)
{
UnlockBufHdr(desc);
- return false;
+ return READ_BUFFER_ALREADY_DONE;
}
if (buf_state & BM_IO_IN_PROGRESS)
{
+			/* Join existing read */
+			if (pgaio_wref_valid(&desc->io_wref))
+			{
+				operation->io_wref = desc->io_wref;
+				operation->foreign_io = true;
+				UnlockBufHdr(desc);
+				return READ_BUFFER_IN_PROGRESS;
+			}

Out of a strict sense of rule-following, I'd do the operation->foreign_io
after the UnlockBufHdr(), since it doesn't actually need to be in the locked
section.

@@ -1959,11 +2002,45 @@ WaitReadBuffers(ReadBuffersOperation *operation)
Assert(pgaio_wref_check_done(&operation->io_wref));
}

-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc = BufferIsLocal(buffer) ?
+					GetLocalBufferDescriptor(-buffer - 1) :
+					GetBufferDescriptor(buffer - 1);
+				uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+				if (buf_state & BM_VALID)
+				{
+					BlockNumber blocknum = operation->blocknum + operation->nblocks_done;

Maybe we should assert that the buffer's block equals what we expect?

Greetings,

Andres Freund

#15

Andres Freund

andres@anarazel.de

2 months ago

In reply to: Andres Freund (#14)

Re: Don't synchronously wait for already-in-progress IO in read stream

Hi,

On 2026-03-18 16:16:14 -0400, Andres Freund wrote:

I don't think we'll be able to easily test READ_BUFFER_ALREADY_DONE
(though perhaps we aren't testing it for shared buffers either?).

We do reach the READ_BUFFER_ALREADY_DONE in PrepareHeadBufferReadIO(), but
only due to io_method=sync peculiarities (as that only actually performs the
IO later when waiting, it's easy to have two IOs for the same block).

It's probably worth adding tests for that, although I suspect it should be in
001_aio.pl - no read stream required to hit it. I can give it a shot, if you
want?

I started writing some tests and realized that these commits had made that
somewhat harder - by not using StartBufferIO() anymore, the existing test
infrastructure for StartBufferIO() (in test_aio) became less useful. In
effect it actually reduced coverage some.

Thinking about it more, I also got worried about the duplicating of
logic. It's perhaps acceptable with the patches as-is, but we'll soon need
something very similar for AIO writes. Then we'd end up with like 5 variants,
because we'd still need the existing StartBufferIO() for some cases where we
do want to wait (e.g. the edge case in ExtendBufferedRelShared()).

In the attached prototype I replaced your patch introducing
PrepareHeadBufferReadIO()/ PrepareAdditionalBufferReadIO() with one that
instead revises StartBufferIO() to have the enum return value you introduced
and a PgAioWaitRef * argument that callers that would like to asynchronously
wait for the IO to complete (others pass NULL). There are some other cleanups
in it too, see the commit message for more details.

Your final patch doesn't change a whole lot due to this, instead of calling
PrepareHeadBufferReadIO() it now calls StartBufferIO(wait=true,
&operation->io_wref) and instead of PrepareAdditionalBufferReadIO() it does
StartBufferIO(wait=false, NULL). That means it has to set
operation->foreign_io to true itself, but that seems not problematic.

As part of this I replaced the following comment:
+    /*
+     * Submit any staged IO before checking for in-progress IO. Without this,
+     * the wref check below could find IO that this backend staged but hasn't
+     * submitted yet. Waiting on that would PANIC because the owner can't wait
+     * on its own staged IO.
+     */
+    pgaio_submit_staged();

as I am pretty sure that can't be reached. I added an assert + explanation.

I also updated "Restructure AsyncReadBuffers()" to move
pgstat_prepare_report_checksum_failure() and the computation of flags to
before the ReadBuffersCanStartIO(). And added a comment explaining why little
should be added between the ReadBuffersCanStartIO() calls.

Thoughts?

I still want to expand the tests a bit, but I thought that we should resolve
this structural issue first.

Greetings,

Andres Freund

#16

Melanie Plageman

melanieplageman@gmail.com

2 months ago

In reply to: Andres Freund (#15)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Thu, Mar 19, 2026 at 6:22 PM Andres Freund <andres@anarazel.de> wrote:

Thinking about it more, I also got worried about the duplicating of
logic. It's perhaps acceptable with the patches as-is, but we'll soon need
something very similar for AIO writes. Then we'd end up with like 5 variants,
because we'd still need the existing StartBufferIO() for some cases where we
do want to wait (e.g. the edge case in ExtendBufferedRelShared()).

In the attached prototype I replaced your patch introducing
PrepareHeadBufferReadIO()/ PrepareAdditionalBufferReadIO() with one that
instead revises StartBufferIO() to have the enum return value you introduced
and a PgAioWaitRef * argument that callers that would like to asynchronously
wait for the IO to complete (others pass NULL). There are some other cleanups
in it too, see the commit message for more details.

I've come around to this. The aspect I like least is that io_wref is
used both as an output parameter _and_ a decision input parameter.

Some callers never want to wait on in-progress IO (and don't have to),
while others must eventually wait but can defer that waiting as long
as they have a wait reference. If they can't get a wait reference,
they have no way to wait later, so they must wait now. The presence of
io_wref indicates this difference.

I think it's important to express that less mechanically than in your
current header comment and comment in the else block of
StartSharedBufferIO() where we do the waiting. Explaining first—before
detailing argument combinations—why a caller would want to pass
io_wref might help.

However, I do think you need to enumerate the different combinations
of wait and io_wref (as you've done) to make it clear what they are.

I, for example, find it very confusing what wait == false and io_wref
not NULL would mean. If IO is in progress on the buffer and the
io_wref is not valid yet, the caller would get the expected
BUFFER_IO_IN_PROGRESS return value but io_wref won't be set. I could
see callers easily misinterpreting the API and passing this
combination when what they want is wait == true and io_wref not NULL
-- because they don't want to synchronously wait.

I don't have any good suggestions despite thinking about it, though.

Two other things about 0007:

for (int i = nblocks_done + 1; i < operation->nblocks; i++)
{
/* Must be consecutive block numbers. */
Assert(BufferGetBlockNumber(buffers[i - 1]) ==
BufferGetBlockNumber(buffers[i]) - 1);

status = StartBufferIO(buffers[nblocks_done], true, false, NULL);

Copy-paste bug above, should be StartBufferIO(buffers[i],...

I would mention that currently BUFFER_IO_IN_PROGRESS is not used in
the first StartBufferIO() case, so that is dead code as of this commit

I also updated "Restructure AsyncReadBuffers()" to move
pgstat_prepare_report_checksum_failure() and the computation of flags to
before the ReadBuffersCanStartIO(). And added a comment explaining why little
should be added between the ReadBuffersCanStartIO() calls.

Thoughts?

Yea, definitely think the comment is important.

- Melanie

#17

Andres Freund

andres@anarazel.de

2 months ago

In reply to: Melanie Plageman (#16)

Re: Don't synchronously wait for already-in-progress IO in read stream

Hi,

On 2026-03-20 15:50:59 -0400, Melanie Plageman wrote:

On Thu, Mar 19, 2026 at 6:22 PM Andres Freund <andres@anarazel.de> wrote:

Thinking about it more, I also got worried about the duplicating of
logic. It's perhaps acceptable with the patches as-is, but we'll soon need
something very similar for AIO writes. Then we'd end up with like 5 variants,
because we'd still need the existing StartBufferIO() for some cases where we
do want to wait (e.g. the edge case in ExtendBufferedRelShared()).

In the attached prototype I replaced your patch introducing
PrepareHeadBufferReadIO()/ PrepareAdditionalBufferReadIO() with one that
instead revises StartBufferIO() to have the enum return value you introduced
and a PgAioWaitRef * argument that callers that would like to asynchronously
wait for the IO to complete (others pass NULL). There are some other cleanups
in it too, see the commit message for more details.

I've come around to this. The aspect I like least is that io_wref is
used both as an output parameter _and_ a decision input parameter.

Do you have an alternative suggestion? We could add a dedicated parameter for
that, but then it just opens up different ways of calling the function with
the wrong arguments.

Some callers never want to wait on in-progress IO (and don't have to),
while others must eventually wait but can defer that waiting as long
as they have a wait reference. If they can't get a wait reference,
they have no way to wait later, so they must wait now. The presence of
io_wref indicates this difference.

I think it's important to express that less mechanically than in your
current header comment and comment in the else block of
StartSharedBufferIO() where we do the waiting. Explaining first—before
detailing argument combinations—why a caller would want to pass
io_wref might help.

I'm not entirely sure what you'd like here. Would the following comment
do the trick?

* In several scenarios the buffer may already be undergoing I/O in this or
* another backend. How to best handle that depends on the caller's
* situation. It might be appropriate to wait synchronously (e.g., because the
* buffer is about to be invalidated); wait asynchronously, using the buffer's
* IO wait reference (e.g., because the caller is doing readahead and doesn't
* need the buffer to be ready immediately); or to not wait at all (e.g.,
* because the caller is trying to combine IO for this buffer with another
* buffer).
*
* How and whether to wait is controlled by the wait in io_wref parameters. In
* detail:
*
* <existing comment>

However, I do think you need to enumerate the different combinations
of wait and io_wref (as you've done) to make it clear what they are.

I, for example, find it very confusing what wait == false and io_wref
not NULL would mean. If IO is in progress on the buffer and the
io_wref is not valid yet, the caller would get the expected
BUFFER_IO_IN_PROGRESS return value but io_wref won't be set. I could
see callers easily misinterpreting the API and passing this
combination when what they want is wait == true and io_wref not NULL
-- because they don't want to synchronously wait.

Hm. I started out proposing that we should just add an assert documenting this
is a nonsensical combination. But when writing the comment for that I realized
that it theoretically could make sense to pass in wait == false and io_wref !=
NULL, if you wanted to get a wait reference, but would not want to do a
WaitIO() if there's no wait reference set.

I don't think that's something we need right now, but ...

I don't have any good suggestions despite thinking about it, though.

Two other things about 0007:

for (int i = nblocks_done + 1; i < operation->nblocks; i++)
{
/* Must be consecutive block numbers. */
Assert(BufferGetBlockNumber(buffers[i - 1]) ==
BufferGetBlockNumber(buffers[i]) - 1);

status = StartBufferIO(buffers[nblocks_done], true, false, NULL);

Copy-paste bug above, should be StartBufferIO(buffers[i],...

I would mention that currently BUFFER_IO_IN_PROGRESS is not used in
the first StartBufferIO() case, so that is dead code as of this commit

Whaaat. Why did this even pass tests??? I guess it just always failed to
start IO because there already was IO on the buffer and that was good enough
to get through.

Clearly a testing gap.

Not entirely trivial to test though :(.

Greetings,

Andres Freund

#18

Melanie Plageman

melanieplageman@gmail.com

2 months ago

In reply to: Andres Freund (#17)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Wed, Mar 25, 2026 at 11:15 AM Andres Freund <andres@anarazel.de> wrote:

In the attached prototype I replaced your patch introducing
PrepareHeadBufferReadIO()/ PrepareAdditionalBufferReadIO() with one that
instead revises StartBufferIO() to have the enum return value you introduced
and a PgAioWaitRef * argument that callers that would like to asynchronously
wait for the IO to complete (others pass NULL). There are some other cleanups
in it too, see the commit message for more details.

I've come around to this. The aspect I like least is that io_wref is
used both as an output parameter _and_ a decision input parameter.

Do you have an alternative suggestion? We could add a dedicated parameter for
that, but then it just opens up different ways of calling the function with
the wrong arguments.

Yea, I don't know. The cleanest way to handle callers with different
intentions and tolerances is to have different functions that allow
different behavior. But that's the opposite of what we're currently
trying to do :) I tried to come up with some intention-related enum
input argument, but that seems like a bit of over-engineering.

Some callers never want to wait on in-progress IO (and don't have to),
while others must eventually wait but can defer that waiting as long
as they have a wait reference. If they can't get a wait reference,
they have no way to wait later, so they must wait now. The presence of
io_wref indicates this difference.

I think it's important to express that less mechanically than in your
current header comment and comment in the else block of
StartSharedBufferIO() where we do the waiting. Explaining first—before
detailing argument combinations—why a caller would want to pass
io_wref might help.

I'm not entirely sure what you'd like here. Would the following comment
do the trick?

* In several scenarios the buffer may already be undergoing I/O in this or
* another backend. How to best handle that depends on the caller's
* situation. It might be appropriate to wait synchronously (e.g., because the
* buffer is about to be invalidated); wait asynchronously, using the buffer's
* IO wait reference (e.g., because the caller is doing readahead and doesn't
* need the buffer to be ready immediately); or to not wait at all (e.g.,
* because the caller is trying to combine IO for this buffer with another
* buffer).
*
* How and whether to wait is controlled by the wait in io_wref parameters. In
* detail:
*
* <existing comment>

Sounds good to me.

However, I do think you need to enumerate the different combinations
of wait and io_wref (as you've done) to make it clear what they are.

I, for example, find it very confusing what wait == false and io_wref
not NULL would mean. If IO is in progress on the buffer and the
io_wref is not valid yet, the caller would get the expected
BUFFER_IO_IN_PROGRESS return value but io_wref won't be set. I could
see callers easily misinterpreting the API and passing this
combination when what they want is wait == true and io_wref not NULL
-- because they don't want to synchronously wait.

Hm. I started out proposing that we should just add an assert documenting this
is a nonsensical combination. But when writing the comment for that I realized
that it theoretically could make sense to pass in wait == false and io_wref !=
NULL, if you wanted to get a wait reference, but would not want to do a
WaitIO() if there's no wait reference set.

I don't think that's something we need right now, but ...

We should somehow express this in the comment.

- Melanie

#19

Andres Freund

andres@anarazel.de

2 months ago

In reply to: Melanie Plageman (#18)

Re: Don't synchronously wait for already-in-progress IO in read stream

Hi,

Attached is an updated set of patches. I fixed the bug that Melanie noticed,
updated the comments a bit further and added a new commit that adds a decent
bit of coverage of StartReadBuffers(), including all the cases in
StartSharedBufferIO and most of the cases in StartLocalBufferIO().

I'm planning to commit 0001 soon - it hasn't changed in a while. Then I'd like
to get 0002 committed soon after, but I'll hold off for that until tomorrow,
given that nobody has looked at it (as it's new). I think 0004-0007 can be
committed too, but I am not sure if you (Melanie) want to do so.

I'd like to get the rest committed tomorrow too.

Greetings,

Andres Freund

#20

Melanie Plageman

melanieplageman@gmail.com

2 months ago

In reply to: Andres Freund (#19)

Re: Don't synchronously wait for already-in-progress IO in read stream

On Wed, Mar 25, 2026 at 5:58 PM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon - it hasn't changed in a while. Then I'd like
to get 0002 committed soon after, but I'll hold off for that until tomorrow,
given that nobody has looked at it (as it's new). I think 0004-0007 can be
committed too, but I am not sure if you (Melanie) want to do so.

This is a review of 0002

in read_buffers():
- you forgot to initialize operation->forknum
- I think the output columns could really use a bit of documentation
- I assume that using uint16 for nblocks and nios is to make sure it
plays nice with the input nblocks which is an int32 -- it doesn't
matter for the range of values you're using, but it would be better if
you could just use uint32 everywhere but I guess because we only have
int4 in SQL you can't.
anyway, I find all the many different types of ints and uints in
read_buffers() pretty sus. Like why do you need this cast to int16?
that seems...wrong and unnecessary?
values[3] = Int32GetDatum((int16) nblocks_this_io);

in the evict_rel() refactor:
- you need invalidate_one_block() to use the forknum parameter because
otherwise temp tables won't evict any forks except the main fork

for the tests themselves:
- for the first test,
# check that one larger read is done as multiple reads
isn't the comment actually the opposite of what it is testing?

0|0|t|2 -- would be 1 2 block io starting at 0, no?

seems like something like
# check that consecutive misses are combined into one read
would be better

- for this comment:
# but if we do it again, i.e. it's in s_b, there will be two operations
technically you are also doing this for temp tables, so the comment
isn't entirely correct.

- For this test:
# Verify that we aren't doing reads larger than io_combine_limit
isn't this more just covering the logic in read_buffers()? AFAICT
StartReadBuffers() only worries about the max IOs it can combine if it
is near the segment boundary

- For this:
$psql_a->query_safe(qq|SELECT invalidate_rel_block('$table', 1)|);
$psql_a->query_safe(qq|SELECT invalidate_rel_block('$table', 2)|);
$psql_a->query_safe(qq|SELECT * FROM read_buffers('$table', 3, 2)|);
psql_like(
$io_method,
$psql_a,
"$persistency: read buffers, miss 1-2, hit 3-4",
qq|SELECT blockoff, blocknum, needs_io, nblocks FROM
read_buffers('$table', 1, 4)|,
qr/^0\|1\|t\|2\n2\|3\|f\|1\n3\|4\|f\|1$/,
qr/^$/);

I think this is a duplicate. There is one before and after the "verify
we aren't doing reads larger than io_combine_limit"

- It may be worth adding one more test case which is IO in progress on
the last block since you have in-progress as the first and the middle
blocks but not as the last block

# Test in-progress IO on the last block of the range
$psql_a->query_safe(qq|SELECT evict_rel('$table')|);
$psql_a->query_safe(
qq|SELECT read_rel_block_ll('$table', 3, wait_complete=>false)|);
psql_like(
$io_method,
$psql_a,
"$persistency: read buffers, in-progress 3, read 1-3",
qq|SELECT blockoff, blocknum, needs_io, nblocks FROM
read_buffers('$table', 1, 3)|,
qr/^0\|1\|t\|2\n2\|3\|f\|1$/,
qr/^$/);

- Melanie

#21

Andres Freund

andres@anarazel.de

2 months ago

In reply to: Melanie Plageman (#20)

#22

Andres Freund

andres@anarazel.de

2 months ago

In reply to: Andres Freund (#21)

#23