pgstattuple: Use streaming read API in pgstatindex functions
Hi hackers,
While reading the code related to streaming reads and their current
use cases, I noticed that pgstatindex could potentially benefit from
adopting the streaming read API. The required change is relatively
simple—similar to what has already been implemented in the pg_warm and
pg_visibility extensions. I also ran some performance tests on an
experimental patch to validate the improvement.
Summary
Cold cache performance (the typical use case for diagnostic tools):
- Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup)
- Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup)
- Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup)
Hardware: AX162-R from hetzner
Test matrix:
- Index types: Primary key, timestamp, float, composite (3 columns)
- Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge
(50M rows, ~ 1351MB))
- Layouts: Unfragmented (sequential) and Fragmented (random insert order)
- Cache states: Cold (dropped OS cache) and Warm (pg_prewarm)
Xlarge fragmented example:
==> Creating secondary indexes on test_xlarge
Created 3 secondary indexes: created_at, score, composite
Created test_xlarge_pkey: 1351 MB
Fragmentation stats (random insert order):
leaf_frag_pct | avg_density_pct | leaf_pages | size
---------------+-----------------+------------+---------
49.9 | 71.5 | 172272 | 1351 MB
(1 row)
configuration:
- shared_buffers = 16GB
- effective_io_concurrency = 500
- io_combine_limit = 16
- autovacuum = off
- checkpoint_timeout = 1h
- bgwriter_delay = 10000ms (minimize background writes)
- jit = off
- max_parallel_workers_per_gather = 0
Unfragmented Indexes (Cold Cache)
Index Type Size Baseline Patched Speedup
Primary Key Medium 31.5 ms 19.6 ms 1.58×
Primary Key Large 184.0 ms 119.0 ms 1.54×
Timestamp Medium 13.4 ms 10.5 ms 1.28×
Timestamp Large 85.0 ms 45.6 ms 1.86×
Float (score) Medium 13.7 ms 11.4 ms 1.21×
Float (score) Large 84.0 ms 45.0 ms 1.86×
Composite (3 col) Medium 26.7 ms 17.2 ms 1.56×
Composite (3 col) Large 89.8 ms 51.2 ms 1.75×
⸻
Fragmented Indexes (Cold Cache)
To address concerns about filesystem fragmentation, I tested indexes built
with random inserts (ORDER BY random()) to trigger page splits and create
fragmented indexes:
Index Type Size Baseline Patched Speedup
Primary Key Medium 41.9 ms 23.5 ms 1.79×
Primary Key Large 236.0 ms 148.0 ms 1.58×
Primary Key XLarge 953.4 ms 663.1 ms 1.43×
Timestamp Medium 32.1 ms 18.8 ms 1.70×
Timestamp Large 188.0 ms 117.0 ms 1.59×
Timestamp XLarge 493.0 ms 518.6 ms 0.95×
Float (score) Medium 14.0 ms 10.9 ms 1.28×
Float (score) Large 85.8 ms 45.2 ms 1.89×
Float (score) XLarge 263.2 ms 176.5 ms 1.49×
Composite (3 col) Medium 42.3 ms 24.1 ms 1.75×
Composite (3 col) Large 245.0 ms 162.0 ms 1.51×
Composite (3 col) XLarge 1052.5 ms 716.5 ms 1.46×
Summary: Fragmentation generally does not hurt streaming reads; most
fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge
Timestamp) shows a slight regression (0.95×).
⸻
Warm Cache Results
When indexes are fully cached in shared_buffers:
Unfragmented: infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Fragmented: infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Best,
Xuneng
Attachments:
0001-pgstattuple-Use-streaming-read-API-in-pgstatindex-fu.patchapplication/x-patch; name=0001-pgstattuple-Use-streaming-read-API-in-pgstatindex-fu.patchDownload
From 153ab2799803dc402789d0aa825456ea12f2d3d9 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 12 Oct 2025 21:27:22 +0800
Subject: [PATCH] pgstattuple: Use streaming read API in pgstatindex functions
Replace synchronous ReadBufferExtended() loops with the streaming read
API in pgstatindex_impl() and pgstathashindex().
---
contrib/pgstattuple/pgstatindex.c | 205 ++++++++++++++++++------------
1 file changed, 121 insertions(+), 84 deletions(-)
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 40823d54fca..a708cc417b0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -37,6 +37,7 @@
#include "funcapi.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/varlena.h"
@@ -273,60 +274,77 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
indexStat.fragments = 0;
/*
- * Scan all blocks except the metapage
+ * Scan all blocks except the metapage using streaming reads
*/
nblocks = RelationGetNumberOfBlocks(rel);
- for (blkno = 1; blkno < nblocks; blkno++)
{
- Buffer buffer;
- Page page;
- BTPageOpaque opaque;
-
- CHECK_FOR_INTERRUPTS();
-
- /* Read and lock buffer */
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
- LockBuffer(buffer, BUFFER_LOCK_SHARE);
-
- page = BufferGetPage(buffer);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Determine page type, and update totals.
- *
- * Note that we arbitrarily bucket deleted pages together without
- * considering if they're leaf pages or internal pages.
- */
- if (P_ISDELETED(opaque))
- indexStat.deleted_pages++;
- else if (P_IGNORE(opaque))
- indexStat.empty_pages++; /* this is the "half dead" state */
- else if (P_ISLEAF(opaque))
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
+
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
+ for (blkno = 1; blkno < nblocks; blkno++)
{
- int max_avail;
+ Buffer buffer;
+ Page page;
+ BTPageOpaque opaque;
- max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
- indexStat.max_avail += max_avail;
- indexStat.free_space += PageGetExactFreeSpace(page);
+ CHECK_FOR_INTERRUPTS();
- indexStat.leaf_pages++;
+ buffer = read_stream_next_buffer(stream, NULL);
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buffer);
+ opaque = BTPageGetOpaque(page);
/*
- * If the next leaf is on an earlier block, it means a
- * fragmentation.
+ * Determine page type, and update totals.
+ *
+ * Note that we arbitrarily bucket deleted pages together without
+ * considering if they're leaf pages or internal pages.
*/
- if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
- indexStat.fragments++;
- }
- else
- indexStat.internal_pages++;
+ if (P_ISDELETED(opaque))
+ indexStat.deleted_pages++;
+ else if (P_IGNORE(opaque))
+ indexStat.empty_pages++; /* this is the "half dead" state */
+ else if (P_ISLEAF(opaque))
+ {
+ int max_avail;
- /* Unlock and release buffer */
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
+ indexStat.max_avail += max_avail;
+ indexStat.free_space += PageGetExactFreeSpace(page);
+
+ indexStat.leaf_pages++;
+
+ /*
+ * If the next leaf is on an earlier block, it means a
+ * fragmentation.
+ */
+ if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
+ indexStat.fragments++;
+ }
+ else
+ indexStat.internal_pages++;
+
+ UnlockReleaseBuffer(buffer);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+}
+
relation_close(rel, AccessShareLock);
/*----------------------------
@@ -636,60 +654,79 @@ pgstathashindex(PG_FUNCTION_ARGS)
/* prepare access strategy for this index */
bstrategy = GetAccessStrategy(BAS_BULKREAD);
- /* Start from blkno 1 as 0th block is metapage */
- for (blkno = 1; blkno < nblocks; blkno++)
+ /* Scan all blocks except the metapage using streaming reads */
{
- Buffer buf;
- Page page;
-
- CHECK_FOR_INTERRUPTS();
-
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- bstrategy);
- LockBuffer(buf, BUFFER_LOCK_SHARE);
- page = BufferGetPage(buf);
-
- if (PageIsNew(page))
- stats.unused_pages++;
- else if (PageGetSpecialSize(page) !=
- MAXALIGN(sizeof(HashPageOpaqueData)))
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg("index \"%s\" contains corrupted page at block %u",
- RelationGetRelationName(rel),
- BufferGetBlockNumber(buf))));
- else
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
+
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
+ for (blkno = 1; blkno < nblocks; blkno++)
{
- HashPageOpaque opaque;
- int pagetype;
+ Buffer buf;
+ Page page;
- opaque = HashPageGetOpaque(page);
- pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+ CHECK_FOR_INTERRUPTS();
- if (pagetype == LH_BUCKET_PAGE)
- {
- stats.bucket_pages++;
- GetHashPageStats(page, &stats);
- }
- else if (pagetype == LH_OVERFLOW_PAGE)
- {
- stats.overflow_pages++;
- GetHashPageStats(page, &stats);
- }
- else if (pagetype == LH_BITMAP_PAGE)
- stats.bitmap_pages++;
- else if (pagetype == LH_UNUSED_PAGE)
+ buf = read_stream_next_buffer(stream, NULL);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
stats.unused_pages++;
- else
+ else if (PageGetSpecialSize(page) !=
+ MAXALIGN(sizeof(HashPageOpaqueData)))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
- opaque->hasho_flag, RelationGetRelationName(rel),
+ errmsg("index \"%s\" contains corrupted page at block %u",
+ RelationGetRelationName(rel),
BufferGetBlockNumber(buf))));
- }
+ else
+ {
+ HashPageOpaque opaque;
+ int pagetype;
+
+ opaque = HashPageGetOpaque(page);
+ pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+
+ if (pagetype == LH_BUCKET_PAGE)
+ {
+ stats.bucket_pages++;
+ GetHashPageStats(page, &stats);
+ }
+ else if (pagetype == LH_OVERFLOW_PAGE)
+ {
+ stats.overflow_pages++;
+ GetHashPageStats(page, &stats);
+ }
+ else if (pagetype == LH_BITMAP_PAGE)
+ stats.bitmap_pages++;
+ else if (pagetype == LH_UNUSED_PAGE)
+ stats.unused_pages++;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
+ opaque->hasho_flag, RelationGetRelationName(rel),
+ BufferGetBlockNumber(buf))));
+ }
UnlockReleaseBuffer(buf);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+}
+
/* Done accessing the index */
index_close(rel, AccessShareLock);
--
2.51.0
Hi,
On Mon, Oct 13, 2025 at 10:07 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi hackers,
While reading the code related to streaming reads and their current
use cases, I noticed that pgstatindex could potentially benefit from
adopting the streaming read API. The required change is relatively
simple—similar to what has already been implemented in the pg_warm and
pg_visibility extensions. I also ran some performance tests on an
experimental patch to validate the improvement.Summary
Cold cache performance (the typical use case for diagnostic tools):
- Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup)
- Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup)
- Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup)Hardware: AX162-R from hetzner
Test matrix:
- Index types: Primary key, timestamp, float, composite (3 columns)
- Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge
(50M rows, ~ 1351MB))
- Layouts: Unfragmented (sequential) and Fragmented (random insert order)
- Cache states: Cold (dropped OS cache) and Warm (pg_prewarm)Xlarge fragmented example:
==> Creating secondary indexes on test_xlarge
Created 3 secondary indexes: created_at, score, composite
Created test_xlarge_pkey: 1351 MB
Fragmentation stats (random insert order):
leaf_frag_pct | avg_density_pct | leaf_pages | size
---------------+-----------------+------------+---------
49.9 | 71.5 | 172272 | 1351 MB
(1 row)configuration:
- shared_buffers = 16GB
- effective_io_concurrency = 500
- io_combine_limit = 16
- autovacuum = off
- checkpoint_timeout = 1h
- bgwriter_delay = 10000ms (minimize background writes)
- jit = off
- max_parallel_workers_per_gather = 0Unfragmented Indexes (Cold Cache)
Index Type Size Baseline Patched Speedup
Primary Key Medium 31.5 ms 19.6 ms 1.58×
Primary Key Large 184.0 ms 119.0 ms 1.54×
Timestamp Medium 13.4 ms 10.5 ms 1.28×
Timestamp Large 85.0 ms 45.6 ms 1.86×
Float (score) Medium 13.7 ms 11.4 ms 1.21×
Float (score) Large 84.0 ms 45.0 ms 1.86×
Composite (3 col) Medium 26.7 ms 17.2 ms 1.56×
Composite (3 col) Large 89.8 ms 51.2 ms 1.75×⸻
Fragmented Indexes (Cold Cache)
To address concerns about filesystem fragmentation, I tested indexes built
with random inserts (ORDER BY random()) to trigger page splits and create
fragmented indexes:Index Type Size Baseline Patched Speedup
Primary Key Medium 41.9 ms 23.5 ms 1.79×
Primary Key Large 236.0 ms 148.0 ms 1.58×
Primary Key XLarge 953.4 ms 663.1 ms 1.43×
Timestamp Medium 32.1 ms 18.8 ms 1.70×
Timestamp Large 188.0 ms 117.0 ms 1.59×
Timestamp XLarge 493.0 ms 518.6 ms 0.95×
Float (score) Medium 14.0 ms 10.9 ms 1.28×
Float (score) Large 85.8 ms 45.2 ms 1.89×
Float (score) XLarge 263.2 ms 176.5 ms 1.49×
Composite (3 col) Medium 42.3 ms 24.1 ms 1.75×
Composite (3 col) Large 245.0 ms 162.0 ms 1.51×
Composite (3 col) XLarge 1052.5 ms 716.5 ms 1.46×Summary: Fragmentation generally does not hurt streaming reads; most
fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge
Timestamp) shows a slight regression (0.95×).⸻
Warm Cache Results
When indexes are fully cached in shared_buffers:
Unfragmented: infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Fragmented: infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Fix indentation issue in v1.
Best,
Xuneng
Attachments:
v2-0001-pgstattuple-Use-streaming-read-API-in-pgstatindex.patchapplication/octet-stream; name=v2-0001-pgstattuple-Use-streaming-read-API-in-pgstatindex.patchDownload
From 6cd94d60d8982c60d230ccca9fc6ae073f25a8b9 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Mon, 13 Oct 2025 11:00:50 +0800
Subject: [PATCH v2] pgstattuple: Use streaming read API in pgstatindex
functions
Replace synchronous ReadBufferExtended() loops with the streaming read
API in pgstatindex_impl() and pgstathashindex().
---
contrib/pgstattuple/pgstatindex.c | 203 ++++++++++++++++++------------
1 file changed, 120 insertions(+), 83 deletions(-)
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 40823d54fca..4286706f029 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -37,6 +37,7 @@
#include "funcapi.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/varlena.h"
@@ -273,58 +274,75 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
indexStat.fragments = 0;
/*
- * Scan all blocks except the metapage
+ * Scan all blocks except the metapage using streaming reads
*/
nblocks = RelationGetNumberOfBlocks(rel);
- for (blkno = 1; blkno < nblocks; blkno++)
{
- Buffer buffer;
- Page page;
- BTPageOpaque opaque;
-
- CHECK_FOR_INTERRUPTS();
-
- /* Read and lock buffer */
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
- LockBuffer(buffer, BUFFER_LOCK_SHARE);
-
- page = BufferGetPage(buffer);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Determine page type, and update totals.
- *
- * Note that we arbitrarily bucket deleted pages together without
- * considering if they're leaf pages or internal pages.
- */
- if (P_ISDELETED(opaque))
- indexStat.deleted_pages++;
- else if (P_IGNORE(opaque))
- indexStat.empty_pages++; /* this is the "half dead" state */
- else if (P_ISLEAF(opaque))
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
+
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
+ for (blkno = 1; blkno < nblocks; blkno++)
{
- int max_avail;
+ Buffer buffer;
+ Page page;
+ BTPageOpaque opaque;
- max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
- indexStat.max_avail += max_avail;
- indexStat.free_space += PageGetExactFreeSpace(page);
+ CHECK_FOR_INTERRUPTS();
- indexStat.leaf_pages++;
+ buffer = read_stream_next_buffer(stream, NULL);
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buffer);
+ opaque = BTPageGetOpaque(page);
/*
- * If the next leaf is on an earlier block, it means a
- * fragmentation.
+ * Determine page type, and update totals.
+ *
+ * Note that we arbitrarily bucket deleted pages together without
+ * considering if they're leaf pages or internal pages.
*/
- if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
- indexStat.fragments++;
+ if (P_ISDELETED(opaque))
+ indexStat.deleted_pages++;
+ else if (P_IGNORE(opaque))
+ indexStat.empty_pages++; /* this is the "half dead" state */
+ else if (P_ISLEAF(opaque))
+ {
+ int max_avail;
+
+ max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
+ indexStat.max_avail += max_avail;
+ indexStat.free_space += PageGetExactFreeSpace(page);
+
+ indexStat.leaf_pages++;
+
+ /*
+ * If the next leaf is on an earlier block, it means a
+ * fragmentation.
+ */
+ if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
+ indexStat.fragments++;
+ }
+ else
+ indexStat.internal_pages++;
+
+ UnlockReleaseBuffer(buffer);
}
- else
- indexStat.internal_pages++;
- /* Unlock and release buffer */
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
}
relation_close(rel, AccessShareLock);
@@ -636,58 +654,77 @@ pgstathashindex(PG_FUNCTION_ARGS)
/* prepare access strategy for this index */
bstrategy = GetAccessStrategy(BAS_BULKREAD);
- /* Start from blkno 1 as 0th block is metapage */
- for (blkno = 1; blkno < nblocks; blkno++)
+ /* Scan all blocks except the metapage using streaming reads */
{
- Buffer buf;
- Page page;
-
- CHECK_FOR_INTERRUPTS();
-
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- bstrategy);
- LockBuffer(buf, BUFFER_LOCK_SHARE);
- page = BufferGetPage(buf);
-
- if (PageIsNew(page))
- stats.unused_pages++;
- else if (PageGetSpecialSize(page) !=
- MAXALIGN(sizeof(HashPageOpaqueData)))
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg("index \"%s\" contains corrupted page at block %u",
- RelationGetRelationName(rel),
- BufferGetBlockNumber(buf))));
- else
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
+
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
+ for (blkno = 1; blkno < nblocks; blkno++)
{
- HashPageOpaque opaque;
- int pagetype;
+ Buffer buf;
+ Page page;
- opaque = HashPageGetOpaque(page);
- pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+ CHECK_FOR_INTERRUPTS();
- if (pagetype == LH_BUCKET_PAGE)
- {
- stats.bucket_pages++;
- GetHashPageStats(page, &stats);
- }
- else if (pagetype == LH_OVERFLOW_PAGE)
- {
- stats.overflow_pages++;
- GetHashPageStats(page, &stats);
- }
- else if (pagetype == LH_BITMAP_PAGE)
- stats.bitmap_pages++;
- else if (pagetype == LH_UNUSED_PAGE)
+ buf = read_stream_next_buffer(stream, NULL);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
stats.unused_pages++;
- else
+ else if (PageGetSpecialSize(page) !=
+ MAXALIGN(sizeof(HashPageOpaqueData)))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
- opaque->hasho_flag, RelationGetRelationName(rel),
+ errmsg("index \"%s\" contains corrupted page at block %u",
+ RelationGetRelationName(rel),
BufferGetBlockNumber(buf))));
+ else
+ {
+ HashPageOpaque opaque;
+ int pagetype;
+
+ opaque = HashPageGetOpaque(page);
+ pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+
+ if (pagetype == LH_BUCKET_PAGE)
+ {
+ stats.bucket_pages++;
+ GetHashPageStats(page, &stats);
+ }
+ else if (pagetype == LH_OVERFLOW_PAGE)
+ {
+ stats.overflow_pages++;
+ GetHashPageStats(page, &stats);
+ }
+ else if (pagetype == LH_BITMAP_PAGE)
+ stats.bitmap_pages++;
+ else if (pagetype == LH_UNUSED_PAGE)
+ stats.unused_pages++;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
+ opaque->hasho_flag, RelationGetRelationName(rel),
+ BufferGetBlockNumber(buf))));
+ }
+ UnlockReleaseBuffer(buf);
}
- UnlockReleaseBuffer(buf);
+
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
}
/* Done accessing the index */
--
2.51.0
Hi,
Thank you for working on this!
On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote:
Fix indentation issue in v1.
I did not look at the benchmarks, so here are my code comments.
- I would avoid creating a new scope for the streaming read. While it
makes the streaming code easier to interpret, it introduces a large
diff due to indentation changes.
- I suggest adding the following comment about streaming batching, as
it is used in other places:
/*
* It is safe to use batchmode as block_range_read_stream_cb takes no
* locks.
*/
- For the '/* Scan all blocks except the metapage using streaming
reads */' comments, it might be helpful to clarify that the 0th page
is the metapage. Something like: '/* Scan all blocks except the
metapage (0th page) using streaming reads */'.
Other than these comments, the code looks good to me.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi Bilal,
Thanks for looking into this.
On Mon, Oct 13, 2025 at 3:00 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
Thank you for working on this!
On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote:
Fix indentation issue in v1.
I did not look at the benchmarks, so here are my code comments.
- I would avoid creating a new scope for the streaming read. While it
makes the streaming code easier to interpret, it introduces a large
diff due to indentation changes.- I suggest adding the following comment about streaming batching, as
it is used in other places:/*
* It is safe to use batchmode as block_range_read_stream_cb takes no
* locks.
*/- For the '/* Scan all blocks except the metapage using streaming
reads */' comments, it might be helpful to clarify that the 0th page
is the metapage. Something like: '/* Scan all blocks except the
metapage (0th page) using streaming reads */'.Other than these comments, the code looks good to me.
Here is patch v3. The comments have been added, and the extra scope
({}) has been removed as suggested.
Best,
Xuneng
Attachments:
v3-0001-pgstattuple-Use-streaming-read-API-in-pgstatindex.patchapplication/octet-stream; name=v3-0001-pgstattuple-Use-streaming-read-API-in-pgstatindex.patchDownload
From 318941dbcdc45a06b149e1cfc5af38540d46eca5 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Mon, 13 Oct 2025 16:22:10 +0800
Subject: [PATCH v3] pgstattuple: Use streaming read API in pgstatindex
functions
Replace synchronous ReadBufferExtended() loops with the streaming read
API in pgstatindex_impl() and pgstathashindex().
---
contrib/pgstattuple/pgstatindex.c | 57 ++++++++++++++++++++++++++-----
1 file changed, 48 insertions(+), 9 deletions(-)
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 40823d54fca..b6b11ab043f 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -37,6 +37,7 @@
#include "funcapi.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/varlena.h"
@@ -217,6 +218,8 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
BlockNumber blkno;
BTIndexStat indexStat;
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
if (!IS_INDEX(rel) || !IS_BTREE(rel))
ereport(ERROR,
@@ -273,10 +276,26 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
indexStat.fragments = 0;
/*
- * Scan all blocks except the metapage
+ * Scan all blocks except the metapage (0th page) using streaming reads
*/
nblocks = RelationGetNumberOfBlocks(rel);
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ /*
+ * It is safe to use batchmode as block_range_read_stream_cb takes no
+ * locks.
+ */
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
for (blkno = 1; blkno < nblocks; blkno++)
{
Buffer buffer;
@@ -285,8 +304,7 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
CHECK_FOR_INTERRUPTS();
- /* Read and lock buffer */
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ buffer = read_stream_next_buffer(stream, NULL);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
page = BufferGetPage(buffer);
@@ -322,11 +340,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
else
indexStat.internal_pages++;
- /* Unlock and release buffer */
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ UnlockReleaseBuffer(buffer);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+
relation_close(rel, AccessShareLock);
/*----------------------------
@@ -596,6 +615,8 @@ pgstathashindex(PG_FUNCTION_ARGS)
HashMetaPage metap;
float8 free_percent;
uint64 total_space;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
rel = relation_open(relid, AccessShareLock);
@@ -636,7 +657,23 @@ pgstathashindex(PG_FUNCTION_ARGS)
/* prepare access strategy for this index */
bstrategy = GetAccessStrategy(BAS_BULKREAD);
- /* Start from blkno 1 as 0th block is metapage */
+ /* Scan all blocks except the metapage (0th page) using streaming reads */
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ /*
+ * It is safe to use batchmode as block_range_read_stream_cb takes no
+ * locks.
+ */
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
for (blkno = 1; blkno < nblocks; blkno++)
{
Buffer buf;
@@ -644,8 +681,7 @@ pgstathashindex(PG_FUNCTION_ARGS)
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- bstrategy);
+ buf = read_stream_next_buffer(stream, NULL);
LockBuffer(buf, BUFFER_LOCK_SHARE);
page = BufferGetPage(buf);
@@ -690,6 +726,9 @@ pgstathashindex(PG_FUNCTION_ARGS)
UnlockReleaseBuffer(buf);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+
/* Done accessing the index */
index_close(rel, AccessShareLock);
--
2.51.0
Hi Xuneng Zhou
- /* Unlock and release buffer */ - LockBuffer(buffer, BUFFER_LOCK_UNLOCK); - ReleaseBuffer(buffer); + UnlockReleaseBuffer(buffer); }
Thanks for your patch! Just to nitpick a bit — I think this comment is
worth keeping, even though the function name already conveys its meaning.
On Mon, Oct 13, 2025 at 4:42 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Show quoted text
Hi Bilal,
Thanks for looking into this.
On Mon, Oct 13, 2025 at 3:00 PM Nazir Bilal Yavuz <byavuz81@gmail.com>
wrote:Hi,
Thank you for working on this!
On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote:
Fix indentation issue in v1.
I did not look at the benchmarks, so here are my code comments.
- I would avoid creating a new scope for the streaming read. While it
makes the streaming code easier to interpret, it introduces a large
diff due to indentation changes.- I suggest adding the following comment about streaming batching, as
it is used in other places:/*
* It is safe to use batchmode as block_range_read_stream_cbtakes no
* locks.
*/- For the '/* Scan all blocks except the metapage using streaming
reads */' comments, it might be helpful to clarify that the 0th page
is the metapage. Something like: '/* Scan all blocks except the
metapage (0th page) using streaming reads */'.Other than these comments, the code looks good to me.
Here is patch v3. The comments have been added, and the extra scope
({}) has been removed as suggested.Best,
Xuneng
Hi,
On Mon, 13 Oct 2025 at 11:42, Xuneng Zhou <xunengzhou@gmail.com> wrote:
Here is patch v3. The comments have been added, and the extra scope
({}) has been removed as suggested.
Thanks, the code looks good to me!
The benchmark results are nice. I have a quick question about the
setup, if you are benchmarking with the default io_method (worker),
then the results might show the combined benefit of async I/O and
streaming I/O, rather than just streaming I/O alone. If that is the
case, I would suggest also running a benchmark with io_method=sync to
isolate the performance impact of streaming I/O. You might get
interesting results.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi Wenhui,
Thanks for looking into this.
On Mon, Oct 13, 2025 at 5:41 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Hi Xuneng Zhou
- /* Unlock and release buffer */ - LockBuffer(buffer, BUFFER_LOCK_UNLOCK); - ReleaseBuffer(buffer); + UnlockReleaseBuffer(buffer); }Thanks for your patch! Just to nitpick a bit — I think this comment is worth keeping, even though the function name already conveys its meaning.
It makes sense. I'll keep it.
Best,
Xuneng
Hi,
On Mon, Oct 13, 2025 at 5:42 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Mon, 13 Oct 2025 at 11:42, Xuneng Zhou <xunengzhou@gmail.com> wrote:
Here is patch v3. The comments have been added, and the extra scope
({}) has been removed as suggested.Thanks, the code looks good to me!
The benchmark results are nice. I have a quick question about the
setup, if you are benchmarking with the default io_method (worker),
then the results might show the combined benefit of async I/O and
streaming I/O, rather than just streaming I/O alone. If that is the
case, I would suggest also running a benchmark with io_method=sync to
isolate the performance impact of streaming I/O. You might get
interesting results.
The previous benchmark was conducted using the default
io_method=worker. The following results are from a new benchmark run
with io_method=sync. The performance gains from streaming reads are
still present, though smaller than those observed when asynchronous
I/O and streaming I/O are combined — which is expected. Thanks for the
suggestion!
Fragmented Indexes (Cold Cache)
Index Type Size Baseline Patched Speedup
Primary Key Medium 43.80 ms 38.61 ms 1.13×
Primary Key Large 238.24 ms 202.47 ms 1.17×
Primary Key XLarge 962.90 ms 793.57 ms 1.21×
Timestamp Medium 33.34 ms 29.98 ms 1.11×
Timestamp Large 190.41 ms 161.34 ms 1.18×
Timestamp XLarge 794.52 ms 647.82 ms 1.22×
Float (score) Medium 14.46 ms 13.51 ms 1.06×
Float (score) Large 87.38 ms 77.22 ms 1.13×
Float (score) XLarge 278.47 ms 233.22 ms 1.19×
Composite (3 col) Medium 44.49 ms 40.10 ms 1.10×
Composite (3 col) Large 244.86 ms 211.01 ms 1.16×
Composite (3 col) XLarge 1073.32 ms 872.42 ms 1.23×
________________________________
Fragmented Indexes (Warm Cache)
Index Type Size Baseline Patched Speedup
Primary Key Medium 7.91 ms 7.88 ms 1.00×
Primary Key Large 35.58 ms 36.41 ms 0.97×
Primary Key XLarge 126.29 ms 126.95 ms 0.99×
Timestamp Medium 5.14 ms 6.82 ms 0.75×
Timestamp Large 22.96 ms 29.55 ms 0.77×
Timestamp XLarge 104.26 ms 106.01 ms 0.98×
Float (score) Medium 3.76 ms 4.18 ms 0.90×
Float (score) Large 13.65 ms 13.01 ms 1.04×
Float (score) XLarge 40.58 ms 41.28 ms 0.98×
Composite (3 col) Medium 8.23 ms 8.25 ms 0.99×
Composite (3 col) Large 37.10 ms 37.59 ms 0.98×
Composite (3 col) XLarge 139.89 ms 138.21 ms 1.01×
Best,
Xuneng
Attachments:
v4-0001-pgstattuple-Use-streaming-read-API-in-pgstatindex.patchapplication/octet-stream; name=v4-0001-pgstattuple-Use-streaming-read-API-in-pgstatindex.patchDownload
From a34e0aacf5b85f161ff1d53760bd5c64ae08cd2e Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Mon, 13 Oct 2025 19:32:03 +0800
Subject: [PATCH v4] pgstattuple: Use streaming read API in pgstatindex
functions
Replace synchronous ReadBufferExtended() loops with the streaming read
API in pgstatindex_impl() and pgstathashindex().
---
contrib/pgstattuple/pgstatindex.c | 57 ++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 8 deletions(-)
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 40823d54fca..30c0dd1b111 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -37,6 +37,7 @@
#include "funcapi.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/varlena.h"
@@ -217,6 +218,8 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
BlockNumber blkno;
BTIndexStat indexStat;
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
if (!IS_INDEX(rel) || !IS_BTREE(rel))
ereport(ERROR,
@@ -273,10 +276,26 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
indexStat.fragments = 0;
/*
- * Scan all blocks except the metapage
+ * Scan all blocks except the metapage (0th page) using streaming reads
*/
nblocks = RelationGetNumberOfBlocks(rel);
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ /*
+ * It is safe to use batchmode as block_range_read_stream_cb takes no
+ * locks.
+ */
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
for (blkno = 1; blkno < nblocks; blkno++)
{
Buffer buffer;
@@ -285,8 +304,7 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
CHECK_FOR_INTERRUPTS();
- /* Read and lock buffer */
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ buffer = read_stream_next_buffer(stream, NULL);
LockBuffer(buffer, BUFFER_LOCK_SHARE);
page = BufferGetPage(buffer);
@@ -323,10 +341,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
indexStat.internal_pages++;
/* Unlock and release buffer */
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ UnlockReleaseBuffer(buffer);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+
relation_close(rel, AccessShareLock);
/*----------------------------
@@ -596,6 +616,8 @@ pgstathashindex(PG_FUNCTION_ARGS)
HashMetaPage metap;
float8 free_percent;
uint64 total_space;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
rel = relation_open(relid, AccessShareLock);
@@ -636,7 +658,23 @@ pgstathashindex(PG_FUNCTION_ARGS)
/* prepare access strategy for this index */
bstrategy = GetAccessStrategy(BAS_BULKREAD);
- /* Start from blkno 1 as 0th block is metapage */
+ /* Scan all blocks except the metapage (0th page) using streaming reads */
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ /*
+ * It is safe to use batchmode as block_range_read_stream_cb takes no
+ * locks.
+ */
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
for (blkno = 1; blkno < nblocks; blkno++)
{
Buffer buf;
@@ -644,8 +682,7 @@ pgstathashindex(PG_FUNCTION_ARGS)
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- bstrategy);
+ buf = read_stream_next_buffer(stream, NULL);
LockBuffer(buf, BUFFER_LOCK_SHARE);
page = BufferGetPage(buf);
@@ -687,9 +724,13 @@ pgstathashindex(PG_FUNCTION_ARGS)
opaque->hasho_flag, RelationGetRelationName(rel),
BufferGetBlockNumber(buf))));
}
+ /* Unlock and release buffer */
UnlockReleaseBuffer(buf);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+
/* Done accessing the index */
index_close(rel, AccessShareLock);
--
2.51.0
Hi,
Here’s the updated summary report(cold cache, fragmented index), now
including results for the streaming I/O + io_uring configuration.
[image: image.png]
Best,
Xuneng
Attachments:
image.pngimage/png; name=image.pngDownload
�PNG
IHDR N �1�� RiCCPICC Profile (�u��KQ�����X�;�f-�0�}fE�X��G����s"���������E�>��� ��m A�t�Vj������y���<�y�B����E=���������E��*"��3�o���4���jVr�z�����@�<r��7�S�t�bQ?�!KH�r�����������m>S�j�E+�������V�L���F���vq!`}����e���j{�����"�����V>���$�Y��CG��@�*�X��A#��Ww�}��wR��R�[� �U`���&�5�)���j
Weo.�f���up_M�8�5�i�����U��aakNP� �eXIfMM * > F( �i N � � �� x� N� ASCII Screenshots:wQ pHYs % %IR$� �iTXtXML:com.adobe.xmp <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 6.0.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:exif="http://ns.adobe.com/exif/1.0/">
<exif:PixelYDimension>768</exif:PixelYDimension>
<exif:PixelXDimension>2894</exif:PixelXDimension>
<exif:UserComment>Screenshot</exif:UserComment>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
�� iDOT � ( � � ���">