Streamify more code paths

Started by Xuneng Zhou5 months ago38 messageshackers

xunengzhou@gmail.com

5 months ago

Hi Hackers,

I noticed several additional paths in contrib modules, beyond [1]/messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com,
that are potentially suitable for streamification:

1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()

The following patches streamify those code paths. No benchmarks have
been run yet.

[1]: /messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com

Feedbacks welcome.

--
Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#1)

Re: Streamify more code paths

Hi,

On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Hackers,

I noticed several additional paths in contrib modules, beyond [1],
that are potentially suitable for streamification:

1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()

The following patches streamify those code paths. No benchmarks have
been run yet.

[1] /messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com

Feedbacks welcome.

One more in ginvacuumcleanup().

--
Best,
Xuneng

Nazir Bilal Yavuz

byavuz81@gmail.com

5 months ago

In reply to: Xuneng Zhou (#2)

Re: Streamify more code paths

Hi,

Thank you for working on this!

On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Hackers,

I noticed several additional paths in contrib modules, beyond [1],
that are potentially suitable for streamification:

1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()

The following patches streamify those code paths. No benchmarks have
been run yet.

[1] /messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com

Feedbacks welcome.

One more in ginvacuumcleanup().

0001, 0002 and 0004 LGTM.

0003:

+        buf = read_stream_next_buffer(stream, NULL);
+        if (buf == InvalidBuffer)
+            break;

I think we are loosening the check here. We were sure that there were
no InvalidBuffers until the nblocks. Streamified version does not have
this check, it exits from the loop the first time it sees an
InvalidBuffer, which may be wrong. You might want to add
'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
have a similar check.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Nazir Bilal Yavuz (#3)

Re: Streamify more code paths

Hi Bilal,

Thanks for your review!

On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

Thank you for working on this!

On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Hackers,

I noticed several additional paths in contrib modules, beyond [1],
that are potentially suitable for streamification:

1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()

The following patches streamify those code paths. No benchmarks have
been run yet.

[1] /messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com

Feedbacks welcome.

One more in ginvacuumcleanup().

0001, 0002 and 0004 LGTM.

0003:
+        buf = read_stream_next_buffer(stream, NULL);
+        if (buf == InvalidBuffer)
+            break;
I think we are loosening the check here. We were sure that there were
no InvalidBuffers until the nblocks. Streamified version does not have
this check, it exits from the loop the first time it sees an
InvalidBuffer, which may be wrong. You might want to add
'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
have a similar check.

Agree. The check has been added in v2 per your suggestion.

--
Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#4)

Re: Streamify more code paths

Hi,

On Sat, Dec 27, 2025 at 12:41 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Bilal,

Thanks for your review!

On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,

Thank you for working on this!

On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Hackers,

I noticed several additional paths in contrib modules, beyond [1],
that are potentially suitable for streamification:

1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()

The following patches streamify those code paths. No benchmarks have
been run yet.

[1] /messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com

Feedbacks welcome.

One more in ginvacuumcleanup().

0001, 0002 and 0004 LGTM.

0003:
+        buf = read_stream_next_buffer(stream, NULL);
+        if (buf == InvalidBuffer)
+            break;
I think we are loosening the check here. We were sure that there were
no InvalidBuffers until the nblocks. Streamified version does not have
this check, it exits from the loop the first time it sees an
InvalidBuffer, which may be wrong. You might want to add
'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
have a similar check.
Agree. The check has been added in v2 per your suggestion.

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

--
Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#5)

Re: Streamify more code paths

Hi,

On Sun, Dec 28, 2025 at 7:41 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Dec 27, 2025 at 12:41 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Bilal,

Thanks for your review!

On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,

Thank you for working on this!

On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Hackers,

I noticed several additional paths in contrib modules, beyond [1],
that are potentially suitable for streamification:

1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()

The following patches streamify those code paths. No benchmarks have
been run yet.

[1] /messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com

Feedbacks welcome.

One more in ginvacuumcleanup().

0001, 0002 and 0004 LGTM.

0003:
+        buf = read_stream_next_buffer(stream, NULL);
+        if (buf == InvalidBuffer)
+            break;
I think we are loosening the check here. We were sure that there were
no InvalidBuffers until the nblocks. Streamified version does not have
this check, it exits from the loop the first time it sees an
InvalidBuffer, which may be wrong. You might want to add
'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
have a similar check.
Agree. The check has been added in v2 per your suggestion.
Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

--
Best,
Xuneng

Nazir Bilal Yavuz

byavuz81@gmail.com

5 months ago

In reply to: Xuneng Zhou (#6)

Re: Streamify more code paths

Hi,

On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

0003 and 0006:

You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.

0005:

@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
-                                                 RBM_NORMAL, NULL);
+            Buffer        buf = read_stream_next_buffer(stream, NULL);
+
+            if (!BufferIsValid(buf))
+                break;

We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.

0006:

You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.

Rest LGTM!

--
Regards,
Nazir Bilal Yavuz
Microsoft

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Nazir Bilal Yavuz (#7)

Re: Streamify more code paths

Hi,

Thanks for looking into this.

On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

0003 and 0006:

You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.

Done.

0005:
@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
-            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
-                                                 RBM_NORMAL, NULL);
+            Buffer        buf = read_stream_next_buffer(stream, NULL);
+
+            if (!BufferIsValid(buf))
+                break;
We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.

My concern before for not adding assert at the end of streaming is the
potential early break in here:

/* Nothing more to do if all remaining blocks were empty. */
if (nbufs == 0)
break;

After looking more closely, it turns out to be a misunderstanding of the logic.

0006:

You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.

Rest LGTM!

Yeah, reset seems a more proper way here.

--
Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#8)

Re: Streamify more code paths

Hi,

On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Thanks for looking into this.

On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

0003 and 0006:

You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.

Done.
0005:
@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
-            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
-                                                 RBM_NORMAL, NULL);
+            Buffer        buf = read_stream_next_buffer(stream, NULL);
+
+            if (!BufferIsValid(buf))
+                break;
We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.
My concern before for not adding assert at the end of streaming is the
potential early break in here:

/* Nothing more to do if all remaining blocks were empty. */
if (nbufs == 0)
break;

After looking more closely, it turns out to be a misunderstanding of the logic.

0006:

You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.

Rest LGTM!

Yeah, reset seems a more proper way here.

Run pgindent using the updated typedefs.list.

--
Best,
Xuneng

#10

Xuneng Zhou

xunengzhou@gmail.com

4 months ago

In reply to: Xuneng Zhou (#9)

Re: Streamify more code paths

Hi,

On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,

Thanks for looking into this.

On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

0003 and 0006:

You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.

Done.
0005:
@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
-            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
-                                                 RBM_NORMAL, NULL);
+            Buffer        buf = read_stream_next_buffer(stream, NULL);
+
+            if (!BufferIsValid(buf))
+                break;
We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.
My concern before for not adding assert at the end of streaming is the
potential early break in here:

/* Nothing more to do if all remaining blocks were empty. */
if (nbufs == 0)
break;

After looking more closely, it turns out to be a misunderstanding of the logic.

0006:

You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.

Rest LGTM!

Yeah, reset seems a more proper way here.
Run pgindent using the updated typedefs.list.

I've completed benchmarking of the v4 streaming read patches across
three I/O methods (io_uring, sync, worker). Tests were run with cold
cache on large datasets.

--- Settings ---

shared_buffers = '8GB'
effective_io_concurrency = 200
io_method = $IO_METHOD
io_workers = $IO_WORKERS
io_max_concurrency = $IO_MAX_CONCURRENCY
track_io_timing = on
autovacuum = off
checkpoint_timeout = 1h
max_wal_size = 10GB
max_parallel_workers_per_gather = 0

--- Machine ---
CPU: 48-core
RAM: 256 GB DDR5
Disk: 2 x 1.92 TB NVMe SSD

--- Executive Summary ---

The patches provide significant benefits for I/O-bound sequential
operations, with the greatest improvements seen when using
asynchronous I/O methods (io_uring and worker). The synchronous I/O
mode shows reduced but still meaningful gains.

--- Results by I/O Method

Best Results: io_method=worker

bloom_scan: 4.14x (75.9% faster); 93% fewer reads
pgstattuple: 1.59x (37.1% faster); 94% fewer reads
hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads

io_method=io_uring

bloom_scan: 3.12x (68.0% faster); 93% fewer reads
pgstattuple: 1.50x (33.2% faster); 94% fewer reads
hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
wal_logging: 1.00x (-0.5%, neutral); no change in reads

io_method=sync (baseline comparison)

bloom_scan: 1.20x (16.4% faster); 93% fewer reads
pgstattuple: 1.10x (9.0% faster); 94% fewer reads
hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
wal_logging: 0.99x (-0.7%, neutral); no change in reads

--- Observations ---

Async I/O amplifies streaming benefits: The same patches show 3-4x
improvement with worker/io_uring vs 1.2x with sync.

I/O operation reduction is consistent: All modes show the same ~93-94%
reduction in I/O operations for bloom_scan and pgstattuple.

VACUUM operations show modest gains: Despite large I/O reductions
(76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
larger CPU overhead (tuple processing, index maintenance, WAL
logging).

log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).

--
Best,
Xuneng

#11

Xuneng Zhou

xunengzhou@gmail.com

4 months ago

In reply to: Xuneng Zhou (#10)

Re: Streamify more code paths

Hi,

On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,

On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,

Thanks for looking into this.

On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

0003 and 0006:

You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.

Done.
0005:
@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
-            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
-                                                 RBM_NORMAL, NULL);
+            Buffer        buf = read_stream_next_buffer(stream, NULL);
+
+            if (!BufferIsValid(buf))
+                break;
We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.
My concern before for not adding assert at the end of streaming is the
potential early break in here:

/* Nothing more to do if all remaining blocks were empty. */
if (nbufs == 0)
break;

After looking more closely, it turns out to be a misunderstanding of the logic.

0006:

You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.

Rest LGTM!

Yeah, reset seems a more proper way here.
Run pgindent using the updated typedefs.list.
I've completed benchmarking of the v4 streaming read patches across
three I/O methods (io_uring, sync, worker). Tests were run with cold
cache on large datasets.
--- Settings ---
shared_buffers = '8GB'
effective_io_concurrency = 200
io_method = $IO_METHOD
io_workers = $IO_WORKERS
io_max_concurrency = $IO_MAX_CONCURRENCY
track_io_timing = on
autovacuum = off
checkpoint_timeout = 1h
max_wal_size = 10GB
max_parallel_workers_per_gather = 0
--- Machine ---
CPU: 48-core
RAM: 256 GB DDR5
Disk: 2 x 1.92 TB NVMe SSD
--- Executive Summary ---
The patches provide significant benefits for I/O-bound sequential
operations, with the greatest improvements seen when using
asynchronous I/O methods (io_uring and worker). The synchronous I/O
mode shows reduced but still meaningful gains.
--- Results by I/O Method
Best Results: io_method=worker

bloom_scan: 4.14x (75.9% faster); 93% fewer reads
pgstattuple: 1.59x (37.1% faster); 94% fewer reads
hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads

io_method=io_uring

bloom_scan: 3.12x (68.0% faster); 93% fewer reads
pgstattuple: 1.50x (33.2% faster); 94% fewer reads
hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
wal_logging: 1.00x (-0.5%, neutral); no change in reads

io_method=sync (baseline comparison)

bloom_scan: 1.20x (16.4% faster); 93% fewer reads
pgstattuple: 1.10x (9.0% faster); 94% fewer reads
hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
wal_logging: 0.99x (-0.7%, neutral); no change in reads
--- Observations ---
Async I/O amplifies streaming benefits: The same patches show 3-4x
improvement with worker/io_uring vs 1.2x with sync.

I/O operation reduction is consistent: All modes show the same ~93-94%
reduction in I/O operations for bloom_scan and pgstattuple.

VACUUM operations show modest gains: Despite large I/O reductions
(76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
larger CPU overhead (tuple processing, index maintenance, WAL
logging).

log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).

--
Best,
Xuneng

There was an issue in the wal_log test of the original script.

--- The original benchmark used:
ALTER TABLE ... SET LOGGED

This path performs a full table rewrite via ATRewriteTable()
(tablecmds.c). It creates a new relfilenode and copies tuples into it.
It does not call log_newpage_range() on rewritten pages.

log_newpage_range() may only appear indirectly through the
pending-sync logic in storage.c, and only when:

wal_level = minimal, and
relation size < wal_skip_threshold (default 2MB).

Our test tables (1M–20M rows) are far larger than 2MB. In that case,
PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
previous benchmark measured table rewrite I/O, not the
log_newpage_range() path.

--- Current design: GIN index build

The benchmark now uses:
CREATE INDEX ... USING gin (doc_tsv)

This reliably exercises log_newpage_range() because:
- ginbuild() constructs the index and WAL-logs all new index pages
using log_newpage_range().
- This is part of the normal GIN build path, independent of wal_skip_threshold.
- The streaming-read patch modifies the WAL logging path inside
log_newpage_range(), which this test directly targets.

--- Results (wal_logging_large)
worker: 1.00x (+0.5%); no meaningful change in reads
io_uring: 1.01x (+1.3%); no meaningful change in reads
sync: 1.01x (+1.1%); no meaningful change in reads

--
Best,
Xuneng

#12

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Xuneng Zhou (#11)

Re: Streamify more code paths

Hi,

On Mon, Feb 9, 2026 at 6:40 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,

On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,

On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,

Thanks for looking into this.

On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

0003 and 0006:

You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.

Done.
0005:
@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
-            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
-                                                 RBM_NORMAL, NULL);
+            Buffer        buf = read_stream_next_buffer(stream, NULL);
+
+            if (!BufferIsValid(buf))
+                break;
We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.
My concern before for not adding assert at the end of streaming is the
potential early break in here:

/* Nothing more to do if all remaining blocks were empty. */
if (nbufs == 0)
break;

After looking more closely, it turns out to be a misunderstanding of the logic.

0006:

You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.

Rest LGTM!

Yeah, reset seems a more proper way here.
Run pgindent using the updated typedefs.list.
I've completed benchmarking of the v4 streaming read patches across
three I/O methods (io_uring, sync, worker). Tests were run with cold
cache on large datasets.
--- Settings ---
shared_buffers = '8GB'
effective_io_concurrency = 200
io_method = $IO_METHOD
io_workers = $IO_WORKERS
io_max_concurrency = $IO_MAX_CONCURRENCY
track_io_timing = on
autovacuum = off
checkpoint_timeout = 1h
max_wal_size = 10GB
max_parallel_workers_per_gather = 0
--- Machine ---
CPU: 48-core
RAM: 256 GB DDR5
Disk: 2 x 1.92 TB NVMe SSD
--- Executive Summary ---
The patches provide significant benefits for I/O-bound sequential
operations, with the greatest improvements seen when using
asynchronous I/O methods (io_uring and worker). The synchronous I/O
mode shows reduced but still meaningful gains.
--- Results by I/O Method
Best Results: io_method=worker

bloom_scan: 4.14x (75.9% faster); 93% fewer reads
pgstattuple: 1.59x (37.1% faster); 94% fewer reads
hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads

io_method=io_uring

bloom_scan: 3.12x (68.0% faster); 93% fewer reads
pgstattuple: 1.50x (33.2% faster); 94% fewer reads
hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
wal_logging: 1.00x (-0.5%, neutral); no change in reads

io_method=sync (baseline comparison)

bloom_scan: 1.20x (16.4% faster); 93% fewer reads
pgstattuple: 1.10x (9.0% faster); 94% fewer reads
hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
wal_logging: 0.99x (-0.7%, neutral); no change in reads
--- Observations ---
Async I/O amplifies streaming benefits: The same patches show 3-4x
improvement with worker/io_uring vs 1.2x with sync.

I/O operation reduction is consistent: All modes show the same ~93-94%
reduction in I/O operations for bloom_scan and pgstattuple.

VACUUM operations show modest gains: Despite large I/O reductions
(76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
larger CPU overhead (tuple processing, index maintenance, WAL
logging).

log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).

--
Best,
Xuneng
There was an issue in the wal_log test of the original script.
--- The original benchmark used:
ALTER TABLE ... SET LOGGED
This path performs a full table rewrite via ATRewriteTable()
(tablecmds.c). It creates a new relfilenode and copies tuples into it.
It does not call log_newpage_range() on rewritten pages.

log_newpage_range() may only appear indirectly through the
pending-sync logic in storage.c, and only when:

wal_level = minimal, and
relation size < wal_skip_threshold (default 2MB).

Our test tables (1M–20M rows) are far larger than 2MB. In that case,
PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
previous benchmark measured table rewrite I/O, not the
log_newpage_range() path.
--- Current design: GIN index build
The benchmark now uses:
CREATE INDEX ... USING gin (doc_tsv)

This reliably exercises log_newpage_range() because:
- ginbuild() constructs the index and WAL-logs all new index pages
using log_newpage_range().
- This is part of the normal GIN build path, independent of wal_skip_threshold.
- The streaming-read patch modifies the WAL logging path inside
log_newpage_range(), which this test directly targets.
--- Results (wal_logging_large)
worker: 1.00x (+0.5%); no meaningful change in reads
io_uring: 1.01x (+1.3%); no meaningful change in reads
sync: 1.01x (+1.1%); no meaningful change in reads
--
Best,
Xuneng

Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.

--
Best,
Xuneng

#13

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Xuneng Zhou (#12)

Re: Streamify more code paths

On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:

Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.

Looking at the numbers you are posting, it is harder to get excited
about the hash, gin, bloom_vacuum and wal_logging. The worker method
seems more efficient, may show that we are out of noise level.

The results associated to pgstattuple and the bloom scans are on a
different level for the three methods.

Saying that, it is really nice that you have sent the benchmark. The
measurement method looks in line with the goal here after review (IO
stats, calculations), and I have taken some time to run it to get an
idea of the difference for these five code paths, as of (slightly
edited the script for my own environment, result is the same):
./run_streaming_benchmark --baseline --io-method=io_uring/worker

I am not much interested in the sync case, so I have tested the two
other methods:

1) method=IO-uring
bloom_scan_large base= 725.3ms patch= 99.9ms 7.26x
( 86.2%) (reads=19676->1294, io_time=688.36->33.69ms)
bloom_vacuum_large base= 7414.9ms patch= 7455.2ms 0.99x
( -0.5%) (reads=48361->11597, io_time=459.02->257.51ms)
pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x
( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms)
gin_vacuum_large base= 3546.8ms patch= 2317.9ms 1.53x
( 34.6%) (reads=20734->17735, io_time=3244.40->2021.53ms)
hash_vacuum_large base= 12268.5ms patch= 11751.1ms 1.04x
( 4.2%) (reads=76677->15606, io_time=1483.10->315.03ms)
wal_logging_large base= 33713.0ms patch= 32773.9ms 1.03x
( 2.8%) (reads=21641->21641, io_time=81.18->77.25ms)

2) method=worker io-workers=3
bloom_scan_large base= 725.0ms patch= 465.7ms 1.56x
( 35.8%) (reads=19676->1294, io_time=688.70->52.20ms)
bloom_vacuum_large base= 7138.3ms patch= 7156.0ms 1.00x
( -0.2%) (reads=48361->11597, io_time=284.56->64.37ms)
pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x
( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms)
gin_vacuum_large base= 3769.4ms patch= 3716.7ms 1.01x
( 1.4%) (reads=20775->17684, io_time=3562.21->3528.14ms)
hash_vacuum_large base= 11750.1ms patch= 11289.0ms 1.04x
( 3.9%) (reads=76677->15606, io_time=1296.03->98.72ms)
wal_logging_large base= 32862.3ms patch= 33179.7ms 0.99x
( -1.0%) (reads=21641->21641, io_time=91.42->90.59ms)

The bloom scan case is a winner in runtime for both cases, and in
terms of stats we get much better numbers for all of them. These feel
rather in line with what you have, except for pgstattuple's runtime,
still its IO numbers feel good. That's just to say that I'll review
them and try to do something about at least some of the pieces for
this release.
--
Michael

#14

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Michael Paquier (#13)

Re: Streamify more code paths

Hi Michael,

On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:

Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.

Looking at the numbers you are posting, it is harder to get excited
about the hash, gin, bloom_vacuum and wal_logging. The worker method
seems more efficient, may show that we are out of noise level.
The results associated to pgstattuple and the bloom scans are on a
different level for the three methods.

Saying that, it is really nice that you have sent the benchmark. The
measurement method looks in line with the goal here after review (IO
stats, calculations), and I have taken some time to run it to get an
idea of the difference for these five code paths, as of (slightly
edited the script for my own environment, result is the same):
./run_streaming_benchmark --baseline --io-method=io_uring/worker

I am not much interested in the sync case, so I have tested the two
other methods:

1) method=IO-uring
bloom_scan_large base= 725.3ms patch= 99.9ms 7.26x
( 86.2%) (reads=19676->1294, io_time=688.36->33.69ms)
bloom_vacuum_large base= 7414.9ms patch= 7455.2ms 0.99x
( -0.5%) (reads=48361->11597, io_time=459.02->257.51ms)
pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x
( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms)
gin_vacuum_large base= 3546.8ms patch= 2317.9ms 1.53x
( 34.6%) (reads=20734->17735, io_time=3244.40->2021.53ms)
hash_vacuum_large base= 12268.5ms patch= 11751.1ms 1.04x
( 4.2%) (reads=76677->15606, io_time=1483.10->315.03ms)
wal_logging_large base= 33713.0ms patch= 32773.9ms 1.03x
( 2.8%) (reads=21641->21641, io_time=81.18->77.25ms)

2) method=worker io-workers=3
bloom_scan_large base= 725.0ms patch= 465.7ms 1.56x
( 35.8%) (reads=19676->1294, io_time=688.70->52.20ms)
bloom_vacuum_large base= 7138.3ms patch= 7156.0ms 1.00x
( -0.2%) (reads=48361->11597, io_time=284.56->64.37ms)
pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x
( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms)
gin_vacuum_large base= 3769.4ms patch= 3716.7ms 1.01x
( 1.4%) (reads=20775->17684, io_time=3562.21->3528.14ms)
hash_vacuum_large base= 11750.1ms patch= 11289.0ms 1.04x
( 3.9%) (reads=76677->15606, io_time=1296.03->98.72ms)
wal_logging_large base= 32862.3ms patch= 33179.7ms 0.99x
( -1.0%) (reads=21641->21641, io_time=91.42->90.59ms)

The bloom scan case is a winner in runtime for both cases, and in
terms of stats we get much better numbers for all of them. These feel
rather in line with what you have, except for pgstattuple's runtime,
still its IO numbers feel good.

Thanks for running the benchmarks! The performance gains for hash,
gin, bloom_vacuum, and wal_logging is insignificant, likely because
these workloads are not I/O-bound. The default number of I/O workers
is three, which is fairly conservative. When I ran the benchmark
script with a higher number of I/O workers, some runs showed improved
performance.

pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x
( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms)

pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x
( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms)

Yeah, this looks somewhat strange. The io_time has been reduced
significantly, which should also lead to a substantial reduction in
runtime.

method=io_uring
pgstattuple_large base= 5551.5ms patch= 3498.2ms 1.59x
( 37.0%) (reads=206945→12983, io_time=2323.49→207.14ms)

I ran the benchmark for this test again with io_uring, and the result
is consistent with previous runs. I’m not sure what might be
contributing to this behavior.

Another code path that showed significant performance improvement is
pgstatindex [1]/messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com. I've incorporated the test into the script too. Here
are the results from my testing:

method=worker io-workers=12
pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x
( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms)

method=io_uring
pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x
( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms)

That's just to say that I'll review
them and try to do something about at least some of the pieces for
this release.

Thanks for that.

[1]: /messages/by-id/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ@mail.gmail.com

--
Best,
Xuneng

#15

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Michael Paquier (#13)

Re: Streamify more code paths

Hi,

On 2026-03-10 19:28:29 +0900, Michael Paquier wrote:

On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:

Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.

Looking at the numbers you are posting, it is harder to get excited
about the hash, gin, bloom_vacuum and wal_logging.

It's perhaps worth emphasizing that, to allow real world usage of direct IO,
we'll need streaming implementation for most of these. Also, on windows the OS
provided readahead is ... not aggressive, so you'll hit IO stalls much more
frequently than you'd on linux (and some of the BSDs).

It might be a good idea to run the benchmarks with debug_io_direct=data.
That'll make them very slow, since the write side doesn't yet use AIO and thus
will do a lot of synchronous writes, but it should still allow to evaluate the
gains from using read stream.

The other thing that's kinda important to evaluate read streams is to test on
higher latency storage, even without direct IO. Many workloads are not at all
benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.

To be able to test such higher latencies locally, I've found it quite useful
to use dm_delay above a fast disk. See [1]https://docs.kernel.org/admin-guide/device-mapper/delay.html.

The worker method seems more efficient, may show that we are out of noise
level.

I think that's more likely to show that memory bandwidth, probably due to
checksum computations, is a factor. The memory copy (from the kernel page
cache, with buffered IO) and the checksum computations (when checksums are
enabled) are parallelized by worker, but not by io_uring.

Greetings,

Andres Freund

[1]: https://docs.kernel.org/admin-guide/device-mapper/delay.html

https://docs.kernel.org/admin-guide/device-mapper/delay.html

Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be
introduced for it:

umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0 && mount /dev/mapper/delayed /srv/

To update the amount of delay to 3ms the following can be used:
dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3" /dev/md0 && dmsetup resume delayed

(I will often just update the delay to 0 for comparison runs, as that
doesn't require remounting)

#16

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Xuneng Zhou (#14)

Re: Streamify more code paths

Hi,

On 2026-03-10 21:23:26 +0800, Xuneng Zhou wrote:

On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote:
Thanks for running the benchmarks! The performance gains for hash,
gin, bloom_vacuum, and wal_logging is insignificant, likely because
these workloads are not I/O-bound. The default number of I/O workers
is three, which is fairly conservative. When I ran the benchmark
script with a higher number of I/O workers, some runs showed improved
performance.

FWIW, another thing that may be an issue is that you're restarting postgres
all the time, as part of drop_caches(). That means we'll spend time reloading
catalog metadata and initializing shared buffers (the first write to a shared
buffers page is considerably more expensive than later ones, as the backing
memory needs to be initialized first).

I found it useful to use the pg_buffercache extension (specifically
pg_buffercache_evict_relation()) to just drop the relation that is going to be
tested from shared_buffers.

pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x
( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms)

pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x
( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms)

Yeah, this looks somewhat strange. The io_time has been reduced
significantly, which should also lead to a substantial reduction in
runtime.

It's possible that the bottleneck just moved, e.g to the checksum computation,
if you have data checksums enabled.

It's also worth noting that likely each of the test reps measures
something different, as likely
psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"

leads to some out-of-page updates.

You're probably better off deleting some of the data in a transaction that is
then rolled back. That will also unset all-visible, but won't otherwise change
the layout, no matter how many test iterations you run.

I'd also guess that you're seeing a relatively small win because you're
updating every page. When reading every page from disk, the OS can do
efficient readahead. If there are only occasional misses, that does not work.

method=io_uring
pgstattuple_large base= 5551.5ms patch= 3498.2ms 1.59x
( 37.0%) (reads=206945→12983, io_time=2323.49→207.14ms)

I ran the benchmark for this test again with io_uring, and the result
is consistent with previous runs. I’m not sure what might be
contributing to this behavior.

What does a perf profile show? Is the query CPU bound?

Another code path that showed significant performance improvement is
pgstatindex [1]. I've incorporated the test into the script too. Here
are the results from my testing:

method=worker io-workers=12
pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x
( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms)

method=io_uring
pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x
( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms)

Nice!

Greetings,

Andres Freund

#17

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Andres Freund (#15)

Re: Streamify more code paths

On Tue, Mar 10, 2026 at 07:04:37PM -0400, Andres Freund wrote:

It might be a good idea to run the benchmarks with debug_io_direct=data.
That'll make them very slow, since the write side doesn't yet use AIO and thus
will do a lot of synchronous writes, but it should still allow to evaluate the
gains from using read stream.

Ah, thanks for the tip. I'll go try that.

The other thing that's kinda important to evaluate read streams is to test on
higher latency storage, even without direct IO. Many workloads are not at all
benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.

My previous run was on a cloud instance, I don't have access to a SSD
with this amount of latency locally.

One thing that was standing on is the bloom bitmap case that was
looking really nice for a large number of rows, so I have applied
this part. The rest is going to need a bit more testing to build more
confidence, as far as I can see.
--
Michael

#18

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Andres Freund (#16)

Re: Streamify more code paths

Hi,

On 2026-03-10 19:27:59 -0400, Andres Freund wrote:

pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x
( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms)

pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x
( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms)

Yeah, this looks somewhat strange. The io_time has been reduced
significantly, which should also lead to a substantial reduction in
runtime.

It's possible that the bottleneck just moved, e.g to the checksum computation,
if you have data checksums enabled.

It's also worth noting that likely each of the test reps measures
something different, as likely
psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"

leads to some out-of-page updates.

You're probably better off deleting some of the data in a transaction that is
then rolled back. That will also unset all-visible, but won't otherwise change
the layout, no matter how many test iterations you run.

I'd also guess that you're seeing a relatively small win because you're
updating every page. When reading every page from disk, the OS can do
efficient readahead. If there are only occasional misses, that does not work.

I think that last one is a big part - if I use
BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK;
(which leaves a lot of

I see much bigger wins due to the pgstattuple changes.

time buffered time DIO
w/o read stream 2222.078 ms 2090.239 ms
w read stream 299.455 ms 155.124 ms

That's with local storage. io_uring, but numbers with worker are similar.

Greetings,

Andres Freund

#19

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Andres Freund (#15)

Re: Streamify more code paths

Hi Andres,

On Wed, Mar 11, 2026 at 7:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-03-10 19:28:29 +0900, Michael Paquier wrote:

On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:

Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.

Looking at the numbers you are posting, it is harder to get excited
about the hash, gin, bloom_vacuum and wal_logging.

It's perhaps worth emphasizing that, to allow real world usage of direct IO,
we'll need streaming implementation for most of these. Also, on windows the OS
provided readahead is ... not aggressive, so you'll hit IO stalls much more
frequently than you'd on linux (and some of the BSDs).

It might be a good idea to run the benchmarks with debug_io_direct=data.
That'll make them very slow, since the write side doesn't yet use AIO and thus
will do a lot of synchronous writes, but it should still allow to evaluate the
gains from using read stream.

The other thing that's kinda important to evaluate read streams is to test on
higher latency storage, even without direct IO. Many workloads are not at all
benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.

To be able to test such higher latencies locally, I've found it quite useful
to use dm_delay above a fast disk. See [1].

Thanks for the tips! I currently don’t have access to a machine or
cloud instance with slower SSDs or HDDs that have higher latency. I’ll
try running the benchmark with debug_io_direct=data and dm_delay, as
you suggested, to see if the results vary.

The worker method seems more efficient, may show that we are out of noise
level.

I think that's more likely to show that memory bandwidth, probably due to
checksum computations, is a factor. The memory copy (from the kernel page
cache, with buffered IO) and the checksum computations (when checksums are
enabled) are parallelized by worker, but not by io_uring.

Greetings,

Andres Freund

[1]

https://docs.kernel.org/admin-guide/device-mapper/delay.html

Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be
introduced for it:

umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0 && mount /dev/mapper/delayed /srv/

To update the amount of delay to 3ms the following can be used:
dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3" /dev/md0 && dmsetup resume delayed

(I will often just update the delay to 0 for comparison runs, as that
doesn't require remounting)

--
Best,
Xuneng

#20

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Andres Freund (#16)

Re: Streamify more code paths

Hi,

On Wed, Mar 11, 2026 at 7:28 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-03-10 21:23:26 +0800, Xuneng Zhou wrote:

On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote:
Thanks for running the benchmarks! The performance gains for hash,
gin, bloom_vacuum, and wal_logging is insignificant, likely because
these workloads are not I/O-bound. The default number of I/O workers
is three, which is fairly conservative. When I ran the benchmark
script with a higher number of I/O workers, some runs showed improved
performance.

FWIW, another thing that may be an issue is that you're restarting postgres
all the time, as part of drop_caches(). That means we'll spend time reloading
catalog metadata and initializing shared buffers (the first write to a shared
buffers page is considerably more expensive than later ones, as the backing
memory needs to be initialized first).

I found it useful to use the pg_buffercache extension (specifically
pg_buffercache_evict_relation()) to just drop the relation that is going to be
tested from shared_buffers.

Good point. I'll switch to using pg_buffercache_evict_relation() to
evict only the target relation, keeping the cluster running. That
should reduce measurement noise to some extend.

pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x
( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms)

pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x
( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms)

Yeah, this looks somewhat strange. The io_time has been reduced
significantly, which should also lead to a substantial reduction in
runtime.

It's possible that the bottleneck just moved, e.g to the checksum computation,
if you have data checksums enabled.

It's also worth noting that likely each of the test reps measures
something different, as likely
psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"

leads to some out-of-page updates.

You're probably better off deleting some of the data in a transaction that is
then rolled back. That will also unset all-visible, but won't otherwise change
the layout, no matter how many test iterations you run.

I'd also guess that you're seeing a relatively small win because you're
updating every page. When reading every page from disk, the OS can do
efficient readahead. If there are only occasional misses, that does not work.

Yeah, the repeated UPDATE changes the table layout across reps. I'll switch to:

BEGIN;
DELETE FROM heap_test WHERE id % N = 0;
ROLLBACK;

This clears the visibility map bits without altering the physical
layout, so every rep measures the same table state.

method=io_uring
pgstattuple_large base= 5551.5ms patch= 3498.2ms 1.59x
( 37.0%) (reads=206945→12983, io_time=2323.49→207.14ms)

I ran the benchmark for this test again with io_uring, and the result
is consistent with previous runs. I’m not sure what might be
contributing to this behavior.

What does a perf profile show? Is the query CPU bound?

The runtime in my run of pgstattuple was reduced significantly due to
the reduction in I/O time. I don’t think running perf on my setup
would reveal anything particularly meaningful. The script has an
option to run with perf, so perhaps Michael could try it to see
whether the query becomes CPU-bound, if he’s interested and has time.

Another code path that showed significant performance improvement is
pgstatindex [1]. I've incorporated the test into the script too. Here
are the results from my testing:

method=worker io-workers=12
pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x
( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms)

method=io_uring
pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x
( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms)

Nice!

Greetings,

Andres Freund