EXPLAIN: showing ReadStream / prefetch stats

Started by Tomas Vondra22 days ago37 messageshackers

tomas.vondra@2ndquadrant.com

22 days ago

Hi,

Here's a patch adding prefetch/read-ahead info about IO to EXPLAIN. This
was initially developed as part of the index prefetch patches, but it
became clear it can be useful for other nodes using ReadStream. That
includes seqscan and bitmap heap scans, and (hopefully) will include
index scan in PG19.

The 0001 patch is "prerequisite" adjusting WaitReadBuffers to indicate
if the buffer read had to wait for the I/O. 0002 is the part this thread
is really about.

In EXPLAIN, this new info is enabled by a new option "IO", which adds
two new lines to the output:

--------------------
EXPLAIN (ANALYZE, COSTS off, TIMING off, IO)
SELECT * FROM t WHERE a < 100000;

QUERY PLAN
-------------------------------------------------
Seq Scan on t (actual rows=999996.00 loops=1)
Filter: (a < 100000)
Rows Removed by Filter: 4
Prefetch: avg=262.629 max=271 capacity=272
I/O: stalls=653 size=15.983 inprogress=15.934
Buffers: shared read=55556
Planning:
Buffers: shared hit=50 read=23
Planning Time: 7.358 ms
Execution Time: 358.214 ms
(10 rows)
--------------------

The first line "Prefetch" tracks the look-ahead distance, i.e. how many
blocks ahead the ReadStream is requesting.

The second line "I/O" is about the I/O requests actually issued - how
many times we had to wait for the block (when we get to process it),
average size of a request (in BLCKSZ blocks), and average number of
in-progress requests.

The above example shows we've been reading ~262 blocks ahead average,
with theoretical maximum of 272 (and at some point we read 271 blocks).
The average is pretty close to the maximum, so we've been reading ~2MB
ahead, which seems pretty good.

The I/O stats says we've been reading in 16-block requests, so 128kB
chunks (which aligns with io_combine_limit=16). And when issuing a new
I/O, there were ~16 requests in-progress already (which aligns with the
maximum distance, because 16*17=272).

We could track much more information. The WIP patches tracked histograms
of various metrics (distances, sizes, ...), and it was quite handy
during development. For regular EXPLAIN we chose to select only the most
useful subset, to make it meaningful and also cheap to collect. And also
generic enough to work for other table AM implementations.

Regarding the basic design, and a couple topics for review questions:

1) The information is collected by ReadStream, unconditionally.

I experimented with having a flag to "request" collecting these stats
(so that it can be done only for EXPLAIN ANALYZE), but that turned out
pretty useless. The stats are super cheap, so the extra flag did not
make a meaningful difference. And passing the flag to the scan/stream
was way too invasive. So all ReadStreams track it.

There's a ReadStreamInstrumentation defined in instrumet_node.h, but I'm
not convinced that's the right place. Maybe it should be defined in
read_stream.h instead.

2) This is inherently a TAM implementation detail.

The ReadStream is "inside" the TAM, and some TAMs may not even use a
ReadStream to do I/O. So EXPLAIN can't just inspect the stream directly,
that'd violate the layering. It needs to do it through the TAM API.

So this introduces a new TAM callback "scan_stats" which gets a scan
descriptor, and extracts stats from the stream (if any). The index
prefetching will need a second callback for IndexScanDesc.

It also introduces a "generic" TableScanStatsData struct, so that it
does not work with ReadStreamInstrumentation (because what if the TAM
does not use a ReadStream). Of course, for heap AM it's 1:1.

For heap AM the callback is pretty simple, it just collects the read
stream stats and "translates" it to the TableScanStats instance.

3) Adds shared instrumentation to SeqScan workers.

The patch also had to improve a parallel SeqScan to collect per-worker
instrumentation, similarly to a BitmapHeapScan. Until now this was not
needed, but it's needed for tracking per-worker prefetch stats.

Here's a couple questions I'm asking myself about this patch:

- Is the selected information really the minimal meaningful set of stats
we could track? Is it sufficiently clear what we're tracking, and/or
should we track something else?

- Is the TableScanStatsData generic enough? Can other TAMs (not using
ReadStream) use this to show meaningful stats?

- If the separation between TAM and the low-level instrumentation clear
enough? Or is the ReadStreamInstrumentation "leaking" somewhere? For
example, is it OK it's in SeqScanInstrumentation?

But I'm sure there are other questions I haven't thought of.

regards

--
Tomas Vondra

Lukas Fittl

lukas@fittl.com

22 days ago

In reply to: Tomas Vondra (#1)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi Tomas,

On Sun, Mar 15, 2026 at 12:49 PM Tomas Vondra <tomas@vondra.me> wrote:

Here's a patch adding prefetch/read-ahead info about IO to EXPLAIN. This
was initially developed as part of the index prefetch patches, but it
became clear it can be useful for other nodes using ReadStream. That
includes seqscan and bitmap heap scans, and (hopefully) will include
index scan in PG19.

I think surfacing this to users makes a lot of sense - read streams
are hard to understand in detail, and this will make it easier to tie
back the I/O performance of a query to what happened in terms of
pre-fetching and stalls.

Big +1 on the concept of making this visible.

In EXPLAIN, this new info is enabled by a new option "IO", which adds
two new lines to the output:

I'm 50/50 if hiding this behind a new option really makes sense - if
its cheap enough to always capture, why not always show it?

e.g. we could consider doing with this what we did with BUFFERS
recently, which is to enable it by default. If someone finds that too
visually busy, they could still do IO OFF.

This also does make me wonder a bit what we should do about I/O
timings. Conceptually they'd belong closer to IO now than BUFFERS..

--------------------
EXPLAIN (ANALYZE, COSTS off, TIMING off, IO)
SELECT * FROM t WHERE a < 100000;

QUERY PLAN
-------------------------------------------------
Seq Scan on t (actual rows=999996.00 loops=1)
Filter: (a < 100000)
Rows Removed by Filter: 4
Prefetch: avg=262.629 max=271 capacity=272
I/O: stalls=653 size=15.983 inprogress=15.934
Buffers: shared read=55556
Planning:
Buffers: shared hit=50 read=23
Planning Time: 7.358 ms
Execution Time: 358.214 ms
(10 rows)
--------------------

The first line "Prefetch" tracks the look-ahead distance, i.e. how many
blocks ahead the ReadStream is requesting.

The second line "I/O" is about the I/O requests actually issued - how
many times we had to wait for the block (when we get to process it),
average size of a request (in BLCKSZ blocks), and average number of
in-progress requests.

I wonder if we could somehow consolidate this into one line for the
text format? (specifically, moving prefetch into "I/O" at the end?)

I'm also not sure if "max" is really that useful, vs capacity?

1) The information is collected by ReadStream, unconditionally.

I experimented with having a flag to "request" collecting these stats
(so that it can be done only for EXPLAIN ANALYZE), but that turned out
pretty useless. The stats are super cheap, so the extra flag did not
make a meaningful difference. And passing the flag to the scan/stream
was way too invasive. So all ReadStreams track it.

Makes sense. I did some testing with your patch against master, and
couldn't find any meaningful difference - its logical that this would
be cheap enough to always do.

2) This is inherently a TAM implementation detail.

...

3) Adds shared instrumentation to SeqScan workers.

The patch also had to improve a parallel SeqScan to collect per-worker
instrumentation, similarly to a BitmapHeapScan. Until now this was not
needed, but it's needed for tracking per-worker prefetch stats.

I feel like something is off about the complexity of having each node
type ferry back the information. e.g. when you're implementing the
support for index prefetching, it'll require a bunch more changes. In
my mind, there is a reason we have a related problem that we solved
with the current pgBufferUsage, instead of dealing with that on a
per-node basis. I really feel we should have a more generic way of
dealing with this.

I'm saying that not being completely unbiased, because I think this
would be a great fit for the stack-based instrumentation I've been
discussing with Andres and Zsolt over at [0]/messages/by-id/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com. Andres at least
expressed some potential interest in getting that into 19, though its
definitely not a trivial patch set.

If we were to get stack-based instrumentation in, we could easily add
a new "IOUsage" to the Instrumentation struct, avoid any modification
to the TAM interface, and align it with how we treat BufferUsage and
WALUsage.

I've attached a prototype of how that could look like (apply the other
patch set first, v8, see commit fest entry [1]https://commitfest.postgresql.org/patch/6023/ - also attached a
preparatory refactoring of using "Instrumentation" for parallel query
reporting, which avoids having individual structs there). Performance
is actually better for "EXPLAIN (ANALYZE, IO, BUFFERS OFF, TIMING
OFF)" with that patch set on a quick COUNT(*) test, but that's mostly
due to the overhead of ExecProcNode dispatching that I fixed in the
other patch set (v8/0006). I think otherwise this would be similar in
performance thanks to the stack avoiding any "accum diff" logic.

Its also worth noting that this would make it trivial to output this
information for utility commands that have read stream support, or
show aggregate statistics in pg_stat_statements/etc.

Let me know if that's at all of interest and happy to pair up on this
to make it work.

Thanks,
Lukas

[0]: /messages/by-id/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
[1]: https://commitfest.postgresql.org/patch/6023/

--
Lukas Fittl

Melanie Plageman

melanieplageman@gmail.com

22 days ago

In reply to: Tomas Vondra (#1)

Re: EXPLAIN: showing ReadStream / prefetch stats

On Sun, Mar 15, 2026 at 3:49 PM Tomas Vondra <tomas@vondra.me> wrote:

- If the separation between TAM and the low-level instrumentation clear
enough? Or is the ReadStreamInstrumentation "leaking" somewhere? For
example, is it OK it's in SeqScanInstrumentation?

Personally, I don't like having both structs
(ReadStreamInstrumentation and TableScanStatsData).

The executor nodes (SeqScanState, BitmapHeapScanState) already embed
ReadStreamInstrumentation directly in their instrumentation structs,
so we already have a reference to the read stream in table AM-agnostic
code. Having a second identical struct means maintaining two
definitions without any actual benefit.

But I'm sure there are other questions I haven't thought of.

I see a couple more issues with the counting in read_stream.c.

You are double-counting stalls for synchronous IO. You increment
stalls in read_stream_next_buffer() but we actually execute
synchronous IO in WaitReadBuffers and return needed_wait as true,
which will count a stall again.

You are not counting fast path IOs because those don't go through
read_stream_start_pending_read() and instead are started directly by
StartReadBuffer() in read_stream_next_buffer(). Simple diff attached
should fix this.

Also, per worker stats are not displayed when BUFFERS false is passed
even with IO true because of a small oversight. I fixed it in the
attached diff.

- Melanie

Melanie Plageman

melanieplageman@gmail.com

22 days ago

In reply to: Lukas Fittl (#2)

Re: EXPLAIN: showing ReadStream / prefetch stats

On Mon, Mar 16, 2026 at 3:08 AM Lukas Fittl <lukas@fittl.com> wrote:

I'm 50/50 if hiding this behind a new option really makes sense - if
its cheap enough to always capture, why not always show it?

e.g. we could consider doing with this what we did with BUFFERS
recently, which is to enable it by default. If someone finds that too
visually busy, they could still do IO OFF.

I like this idea.

This also does make me wonder a bit what we should do about I/O
timings. Conceptually they'd belong closer to IO now than BUFFERS..

I could see moving the timings to the I/O line.

The second line "I/O" is about the I/O requests actually issued - how
many times we had to wait for the block (when we get to process it),
average size of a request (in BLCKSZ blocks), and average number of
in-progress requests.

I wonder if we could somehow consolidate this into one line for the
text format? (specifically, moving prefetch into "I/O" at the end?)

Next release, I'm hopeful we'll get write combining in and then want
to include write IO details here. Not a reason to avoid making it one
line now, though. However, I think it will be a very long line...

I'm also not sure if "max" is really that useful, vs capacity?

I find it very helpful. If you keep increasing
effective_io_concurrency and io_combine_limit and don't see the max
increasing, I think it is helpful to know what the configured possible
limit would be vs what the read stream actually got you up to. And
because you can set effective_io_concurrency and io_combine_limit per
query, it is nice to know what this value was at the time the query
was run.

I feel like something is off about the complexity of having each node
type ferry back the information. e.g. when you're implementing the
support for index prefetching, it'll require a bunch more changes. In
my mind, there is a reason we have a related problem that we solved
with the current pgBufferUsage, instead of dealing with that on a
per-node basis. I really feel we should have a more generic way of
dealing with this.

<--snip-->

I've attached a prototype of how that could look like (apply the other
patch set first, v8, see commit fest entry [1] - also attached a
preparatory refactoring of using "Instrumentation" for parallel query
reporting, which avoids having individual structs there).

The patch footprint is _much_ nicer with your stack-based
instrumentation. Very cool. I'll leave it to Tomas whether he wants to
create a dependency on your big project a few weeks before feature
freeze, though.

Its also worth noting that this would make it trivial to output this
information for utility commands that have read stream support, or
show aggregate statistics in pg_stat_statements/etc.

Yes this would be a major bonus. Even for currently possible users,
this hasn't been extended to TID Range Scan -- which involves more LOC
and boilerplate.

- Melanie

Tomas Vondra

tomas.vondra@2ndquadrant.com

22 days ago

In reply to: Melanie Plageman (#3)

Re: EXPLAIN: showing ReadStream / prefetch stats

On 3/16/26 18:14, Melanie Plageman wrote:

On Sun, Mar 15, 2026 at 3:49 PM Tomas Vondra <tomas@vondra.me> wrote:

- If the separation between TAM and the low-level instrumentation clear
enough? Or is the ReadStreamInstrumentation "leaking" somewhere? For
example, is it OK it's in SeqScanInstrumentation?

Personally, I don't like having both structs
(ReadStreamInstrumentation and TableScanStatsData).

The executor nodes (SeqScanState, BitmapHeapScanState) already embed
ReadStreamInstrumentation directly in their instrumentation structs,
so we already have a reference to the read stream in table AM-agnostic
code. Having a second identical struct means maintaining two
definitions without any actual benefit.

That's a good point. The executor nodes should not embed anything
readstream-specific, not even in the instrumentation part. AFAICS the
cleanest way would be to replace this with the TableScanStatsData.

However, ReadStreamInstrumentation is not really ReadStream specific,
except for the name. I'm actually wondering if we should have just one
struct, that'd be enough.

But I'm sure there are other questions I haven't thought of.

I see a couple more issues with the counting in read_stream.c.

You are double-counting stalls for synchronous IO. You increment
stalls in read_stream_next_buffer() but we actually execute
synchronous IO in WaitReadBuffers and return needed_wait as true,
which will count a stall again.

You are not counting fast path IOs because those don't go through
read_stream_start_pending_read() and instead are started directly by
StartReadBuffer() in read_stream_next_buffer(). Simple diff attached
should fix this.

Also, per worker stats are not displayed when BUFFERS false is passed
even with IO true because of a small oversight. I fixed it in the
attached diff.

Thanks!

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

22 days ago

In reply to: Melanie Plageman (#4)

Re: EXPLAIN: showing ReadStream / prefetch stats

On 3/16/26 18:29, Melanie Plageman wrote:

On Mon, Mar 16, 2026 at 3:08 AM Lukas Fittl <lukas@fittl.com> wrote:

I'm 50/50 if hiding this behind a new option really makes sense - if
its cheap enough to always capture, why not always show it?

e.g. we could consider doing with this what we did with BUFFERS
recently, which is to enable it by default. If someone finds that too
visually busy, they could still do IO OFF.

I like this idea.

Maybe. I think it's important not to overwhelm the users with too much
information, and so my intuition was to keep it OFF by default. But if
the agreement is that it should be opt-out, I can live with that.

This also does make me wonder a bit what we should do about I/O
timings. Conceptually they'd belong closer to IO now than BUFFERS..

I could see moving the timings to the I/O line.

Moving the I/O timings to the "IO" line seems reasonable to me.

The second line "I/O" is about the I/O requests actually issued - how
many times we had to wait for the block (when we get to process it),
average size of a request (in BLCKSZ blocks), and average number of
in-progress requests.

I wonder if we could somehow consolidate this into one line for the
text format? (specifically, moving prefetch into "I/O" at the end?)

Next release, I'm hopeful we'll get write combining in and then want
to include write IO details here. Not a reason to avoid making it one
line now, though. However, I think it will be a very long line...

Let's not make the lines too long. The distance information is not
really about I/O, it's about an internal queue / heuristics.

FWIW I assume we'd want to track some of the metrics (average IO size,
number of IOs, ...) separate for reads and writes, so I'm not sure those
will go to the same line too.

I'm also not sure if "max" is really that useful, vs capacity?

I find it very helpful. If you keep increasing
effective_io_concurrency and io_combine_limit and don't see the max
increasing, I think it is helpful to know what the configured possible
limit would be vs what the read stream actually got you up to. And
because you can set effective_io_concurrency and io_combine_limit per
query, it is nice to know what this value was at the time the query
was run.

I'm not sure. Shouldn't the average distance give us about the same
thing? Sure, it'll never be exactly the maximum, but if the prefetch
behavior is reasonably stable, it should not be too far.

It'll be different if the prefetch distance varies a lot (e.g. because
parts of the table have vastly different cache hit ratios). But then
it's hard to interpret the "max" too, I think.

So maybe just average + capacity might suffice?

I feel like something is off about the complexity of having each node
type ferry back the information. e.g. when you're implementing the
support for index prefetching, it'll require a bunch more changes. In
my mind, there is a reason we have a related problem that we solved
with the current pgBufferUsage, instead of dealing with that on a
per-node basis. I really feel we should have a more generic way of
dealing with this.

<--snip-->

I've attached a prototype of how that could look like (apply the other
patch set first, v8, see commit fest entry [1] - also attached a
preparatory refactoring of using "Instrumentation" for parallel query
reporting, which avoids having individual structs there).

The patch footprint is _much_ nicer with your stack-based
instrumentation. Very cool. I'll leave it to Tomas whether he wants to
create a dependency on your big project a few weeks before feature
freeze, though.

I don't know. I have not looked at the stack-based instrumentation yet,
so I have no idea how complex or how close to committable it is.

Its also worth noting that this would make it trivial to output this
information for utility commands that have read stream support, or
show aggregate statistics in pg_stat_statements/etc.

Yes this would be a major bonus. Even for currently possible users,
this hasn't been extended to TID Range Scan -- which involves more LOC
and boilerplate.

Oh, I forgot about the TID scan ...

I agree doing this in a way that does not require changes to TAM
interface would be nice.

regards

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

21 days ago

In reply to: Tomas Vondra (#6)

Re: EXPLAIN: showing ReadStream / prefetch stats

On 3/16/26 19:26, Tomas Vondra wrote:

On 3/16/26 18:29, Melanie Plageman wrote:

On Mon, Mar 16, 2026 at 3:08 AM Lukas Fittl <lukas@fittl.com> wrote:

...

I feel like something is off about the complexity of having each node
type ferry back the information. e.g. when you're implementing the
support for index prefetching, it'll require a bunch more changes. In
my mind, there is a reason we have a related problem that we solved
with the current pgBufferUsage, instead of dealing with that on a
per-node basis. I really feel we should have a more generic way of
dealing with this.

<--snip-->

I've attached a prototype of how that could look like (apply the other
patch set first, v8, see commit fest entry [1] - also attached a
preparatory refactoring of using "Instrumentation" for parallel query
reporting, which avoids having individual structs there).

It looks reasonable, and it's nice to not have to worry about passing
the data through the TAM interface etc. Having to do stuff like this:

-	if (instr->instr.need_bufusage || instr->instr.need_walusage)
+	if (instr->instr.need_bufusage || instr->instr.need_walusage ||
instr->instr.need_iousage)

in so many places is not great. It'd really benefit from some macro that
says "if any of the pieces is needed ...".

The patch footprint is _much_ nicer with your stack-based
instrumentation. Very cool. I'll leave it to Tomas whether he wants to
create a dependency on your big project a few weeks before feature
freeze, though.

I don't know. I have not looked at the stack-based instrumentation yet,
so I have no idea how complex or how close to committable it is.

I've started looking at the stack tracking patch only today, and it
looks in pretty good shape. I have no clear idea how close to being
committed it is, so I'm hesitant to make this patch depend on it.

AFAICS the stack-based tracking is meant for resources with just one
instance. E.g. there's only one WAL, or one buffer pool, so when a node
says "update WAL usage" or "increment buffer hit/read", there's one
counter to update. But will that necessarily apply for read streams?

AFAIK existing scans use only a single read stream, but that's only
because none of the built-in index AMs uses a read stream. If we get
index prefetching in PG19, or if any index AM (e.g. pgvector) chooses to
use a read stream internally, how will that work? Will we "merge" the
stats from all read streams used by the node, or what will happen?

I realize this is a similar issue the 0007 part of the stack-based
tracing patch deals with - tracking buffers for index/table separately.
And it does that by making nodeIndexscan.c quite a bit more complex by
expanding index_getnext_slot(). Doesn't seem great, but not sure. I
suppose nodeIndexonlyscan.c will have to do something similar.

regards

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

21 days ago

In reply to: Tomas Vondra (#5)

Re: EXPLAIN: showing ReadStream / prefetch stats

On 3/16/26 18:49, Tomas Vondra wrote:

On 3/16/26 18:14, Melanie Plageman wrote:

...

I see a couple more issues with the counting in read_stream.c.

You are double-counting stalls for synchronous IO. You increment
stalls in read_stream_next_buffer() but we actually execute
synchronous IO in WaitReadBuffers and return needed_wait as true,
which will count a stall again.

You are not counting fast path IOs because those don't go through
read_stream_start_pending_read() and instead are started directly by
StartReadBuffer() in read_stream_next_buffer(). Simple diff attached
should fix this.

Also, per worker stats are not displayed when BUFFERS false is passed
even with IO true because of a small oversight. I fixed it in the
attached diff.

Thanks!

Here's a v2 with your fixes merged. No other changes.

regards

--
Tomas Vondra

Andres Freund

andres@anarazel.de

21 days ago

In reply to: Tomas Vondra (#8)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi,

On 2026-03-17 18:51:15 +0100, Tomas Vondra wrote:

Subject: [PATCH v2 2/2] explain: show prefetch stats in EXPLAIN (ANALYZE,
VERBOSE)

This adds details about AIO / prefetch for a number of executor nodes
using the ReadStream, notably:

- SeqScan
- BitmapHeapScan

The statistics is tracked by the ReadStream, and then propagated up
through the table AM interface.

The ReadStream tracks the statistics unconditionally, i.e. even outside
EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
integer counters), it's not worth gating this by a flag.

The TAM gets one new callback "scan_stats", to collect stats from the
scan (which fetch tuples from the TAM). There is also a new struct
TableScanStatsData/TableScanStats to separate the statistics from the
actual TAM implementation.

It's not really clear to me that this need to go through the tableam
interface. All the accesses already happen within AM code, it seems we could
just populate fields in TableScanDesc TableScanDesc->rs_flags indicates that
that is desired.

That seems like it might be nicer, because

a) there could be multiple read stream instances within a scan
b) we will need one such callback for each different type of scan - so we
could just as well do that within the scan datastructure

Showing any such detail in explain will always be a bit of a piercing of
abstraction layers, I'm not particularly concerned with having optionally
fillable information in *ScanDesc that some AM might not want in that
shape. It seems unlikely that a lot of other architectural decisions would be
based on that, so we can just evolve this in the future.

+		case T_SeqScan:
+			{
+				SharedSeqScanInstrumentation *sinstrument
+					= ((SeqScanState *) planstate)->sinstrument;
+
+				/* get the sum of the counters set within each and every process */
+				if (sinstrument)
+				{
+					for (int i = 0; i < sinstrument->num_workers; ++i)
+					{
+						SeqScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
+
+						stats.prefetch_count += winstrument->stream.prefetch_count;
+						stats.distance_sum += winstrument->stream.distance_sum;
+						if (winstrument->stream.distance_max > stats.distance_max)
+							stats.distance_max = winstrument->stream.distance_max;
+						if (winstrument->stream.distance_capacity > stats.distance_capacity)
+							stats.distance_capacity = winstrument->stream.distance_capacity;
+						stats.stall_count += winstrument->stream.stall_count;
+						stats.io_count += winstrument->stream.io_count;
+						stats.io_nblocks += winstrument->stream.io_nblocks;
+						stats.io_in_progress += winstrument->stream.io_in_progress;
+					}
+				}
+
+				break;
+			}
+		case T_BitmapHeapScan:
+			{
+				SharedBitmapHeapInstrumentation *sinstrument
+					= ((BitmapHeapScanState *) planstate)->sinstrument;
+
+				/* get the sum of the counters set within each and every process */
+				if (sinstrument)
+				{
+					for (int i = 0; i < sinstrument->num_workers; ++i)
+					{
+						BitmapHeapScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
+
+						stats.prefetch_count += winstrument->stream.prefetch_count;
+						stats.distance_sum += winstrument->stream.distance_sum;
+						if (winstrument->stream.distance_max > stats.distance_max)
+							stats.distance_max = winstrument->stream.distance_max;
+						if (winstrument->stream.distance_capacity > stats.distance_capacity)
+							stats.distance_capacity = winstrument->stream.distance_capacity;
+						stats.stall_count += winstrument->stream.stall_count;
+						stats.io_count += winstrument->stream.io_count;
+						stats.io_nblocks += winstrument->stream.io_nblocks;
+						stats.io_in_progress += winstrument->stream.io_in_progress;

It doesn't seem good to duplicate these fairly large blocks. Can't that be in
a helper?

+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			/* prefetch distance info */
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
+							 (stats.distance_sum * 1.0 / stats.prefetch_count),
+							 stats.distance_max,
+							 stats.distance_capacity);
+			appendStringInfoChar(es->str, '\n');
+
+			/* prefetch I/O info (only if there were actual I/Os) */
+			if (stats.stall_count > 0 || stats.io_count > 0)
+			{

Cann there be stalls without IO? Why check both?

+				ExplainIndentText(es);
+				appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
+								 stats.stall_count);
+
+				if (stats.io_count > 0)
+				{
+					appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
+									 (stats.io_nblocks * 1.0 / stats.io_count),
+									 (stats.io_in_progress * 1.0 / stats.io_count));
+				}
+
+				appendStringInfoChar(es->str, '\n');
+			}
+		}
+		else
+		{
+			ExplainOpenGroup("Prefetch", "I/O", true, es);
+
+			ExplainPropertyFloat("Average Distance", NULL,
+								 (stats.distance_sum * 1.0 / stats.prefetch_count), 3, es);
+			ExplainPropertyInteger("Max Distance", NULL,
+								   stats.distance_max, es);
+			ExplainPropertyInteger("Capacity", NULL,
+								   stats.distance_capacity, es);
+			ExplainPropertyUInteger("Stalls", NULL,
+									stats.stall_count, es);
+
+			if (stats.io_count > 0)
+			{
+				ExplainPropertyFloat("Average IO Size", NULL,
+									 (stats.io_nblocks * 1.0 / stats.io_count), 3, es);
+				ExplainPropertyFloat("Average IOs In Progress", NULL,
+									 (stats.io_in_progress * 1.0 / stats.io_count), 3, es);
+			}
+
+			ExplainCloseGroup("Prefetch", "I/O", true, es);
+		}
+	}
+}
+
+/*
+ * show_io_worker_info
+ *		show info about prefetching for a single worker
+ *
+ * Shows prefetching stats for a parallel scan worker.
+ */
+static void
+show_worker_io_info(PlanState *planstate, ExplainState *es, int worker)
+{
+	Plan	   *plan = planstate->plan;
+	ReadStreamInstrumentation *stats = NULL;
+
+	if (!es->io)
+		return;
+
+	/* get instrumentation for the given worker */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapHeapScan:
+			{
+				BitmapHeapScanState *state = ((BitmapHeapScanState *) planstate);
+				SharedBitmapHeapInstrumentation *sinstrument = state->sinstrument;
+				BitmapHeapScanInstrumentation *instrument = &sinstrument->sinstrument[worker];
+
+				stats = &instrument->stream;
+
+				break;
+			}
+		case T_SeqScan:
+			{
+				SeqScanState *state = ((SeqScanState *) planstate);
+				SharedSeqScanInstrumentation *sinstrument = state->sinstrument;
+				SeqScanInstrumentation *instrument = &sinstrument->sinstrument[worker];
+
+				stats = &instrument->stream;
+
+				break;
+			}
+		default:
+			/* ignore other plans */
+			return;
+	}
+
+	/* don't print stats if there's nothing to report */
+	if (stats->prefetch_count > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			/* prefetch distance info */
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
+							 (stats->distance_sum * 1.0 / stats->prefetch_count),
+							 stats->distance_max,
+							 stats->distance_capacity);
+			appendStringInfoChar(es->str, '\n');
+
+			/* prefetch I/O info (only if there were actual I/Os) */
+			if (stats->stall_count > 0 || stats->io_count > 0)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
+								 stats->stall_count);
+
+				if (stats->io_count > 0)
+				{
+					appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
+									 (stats->io_nblocks * 1.0 / stats->io_count),
+									 (stats->io_in_progress * 1.0 / stats->io_count));
+				}
+
+				appendStringInfoChar(es->str, '\n');
+			}
+		}
+		else
+		{
+			ExplainOpenGroup("Prefetch", "I/O", true, es);
+
+			ExplainPropertyFloat("Average Distance", NULL,
+								 (stats->distance_sum * 1.0 / stats->prefetch_count), 3, es);
+			ExplainPropertyInteger("Max Distance", NULL,
+								   stats->distance_max, es);
+			ExplainPropertyInteger("Capacity", NULL,
+								   stats->distance_capacity, es);
+			ExplainPropertyUInteger("Stalls", NULL,
+									stats->stall_count, es);
+
+			if (stats->io_count > 0)
+			{
+				ExplainPropertyFloat("Average IO Size", NULL,
+									 (stats->io_nblocks * 1.0 / stats->io_count), 3, es);
+				ExplainPropertyFloat("Average IOs In Progress", NULL,
+									 (stats->io_in_progress * 1.0 / stats->io_count), 3, es);
+			}
+
+			ExplainCloseGroup("Prefetch", "I/O", true, es);
+		}
+	}
+}

Lots of duplication between show_scan_io_info() and show_worker_io_info(). I
assume that's just due to prototype state?

+/*
+ * read_stream_update_stats_prefetch
+ *		update read_stream stats with current pinned buffer depth

Other stuff in read stream does not do this absurd thing of restating the
function name in a comment and indenting the "title" of the function. Don't do
it here.

+ *
+ * Called once per buffer returned to the consumer in read_stream_next_buffer().
+ * Records the number of pinned buffers at that moment, so we can compute the
+ * average look-ahead depth.
+ */
+static inline void
+read_stream_update_stats_prefetch(ReadStream *stream)
+{
+	stream->stats.prefetch_count++;
+	stream->stats.distance_sum += stream->pinned_buffers;
+	if (stream->pinned_buffers > stream->stats.distance_max)
+		stream->stats.distance_max = stream->pinned_buffers;
+}

+/*
+ * read_stream_update_stats_io
+ *		update read_stream stats about size of I/O requests
+ *
+ * We count the number of I/O requests, size of requests (counted in blocks)
+ * and number of in-progress I/Os.
+ */
+static inline void
+read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
+{
+	stream->stats.io_count++;
+	stream->stats.io_nblocks += nblocks;
+	stream->stats.io_in_progress += in_progress;
+}

What's the point of the caller passing in stream->ios_in_progress? Are you
expecting some callers to pass in different values? If so, why?

Seems a bit confusing to have both stream->stats.io_in_progress and
streams->ios_in_progress. It's also not clear that stats.io_in_progress is a
sum from the name.

@@ -851,6 +894,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
flags)))
{
/* Fast return. */
+				read_stream_update_stats_prefetch(stream);
return buffer;
}

@@ -860,6 +904,9 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
stream->ios_in_progress = 1;
stream->ios[0].buffer_index = oldest_buffer_index;
stream->seq_blocknum = next_blocknum + 1;
+
+			/* update I/O stats */
+			read_stream_update_stats_io(stream, 1, stream->ios_in_progress);
}
else
{
@@ -871,6 +918,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
}

stream->fast_path = false;
+ read_stream_update_stats_prefetch(stream);
return buffer;
}
#endif

Should read_stream_update_stats_prefetch() actually be called when we reached
the end of the stream?

@@ -916,12 +964,17 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
{
int16 io_index = stream->oldest_io_index;
int32 distance; /* wider temporary value, clamped below */
+ bool needed_wait;

/* Sanity check that we still agree on the buffers. */
Assert(stream->ios[io_index].op.buffers ==
&stream->buffers[oldest_buffer_index]);
-		WaitReadBuffers(&stream->ios[io_index].op);
+		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
+
+		/* Count it as a stall if we need to wait for IO */
+		if (needed_wait)
+			stream->stats.stall_count += 1;
Assert(stream->ios_in_progress > 0);
stream->ios_in_progress--;

I'd probably put the stalls_count++ in a helper too, that makes it easier to
test compiling out the stats support and similar things.

@@ -328,6 +329,21 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
si->exact_pages += node->stats.exact_pages;
si->lossy_pages += node->stats.lossy_pages;
+
+		/* collect prefetch info for this process from the read_stream */
+		if ((stats = table_scan_stats(node->ss.ss_currentScanDesc)) != NULL)
+		{
+			si->stream.prefetch_count += stats->prefetch_count;
+			si->stream.distance_sum += stats->distance_sum;
+			if (stats->distance_max > si->stream.distance_max)
+				si->stream.distance_max = stats->distance_max;
+			if (stats->distance_capacity > si->stream.distance_capacity)
+				si->stream.distance_capacity = stats->distance_capacity;
+			si->stream.stall_count += stats->stall_count;
+			si->stream.io_count += stats->io_count;
+			si->stream.io_nblocks += stats->io_nblocks;
+			si->stream.io_in_progress += stats->io_in_progress;
+		}
}

Could this be wrapped in a helper function?

+/*
+ * Generic prefetch stats for table scans.
+ */
+typedef struct TableScanStatsData
+{
+	/* number of buffers returned to consumer (for averaging distance) */
+	uint64		prefetch_count;
+
+	/* sum of pinned_buffers sampled at each buffer return */
+	uint64		distance_sum;
+
+	/* maximum actual pinned_buffers observed during the scan */
+	int16		distance_max;
+
+	/* maximum possible look-ahead distance (max_pinned_buffers) */
+	int16		distance_capacity;
+
+	/* number of stalled reads (waiting for I/O) */
+	uint64		stall_count;
+
+	/* I/O stats */
+	uint64		io_count;		/* number of I/Os */
+	uint64		io_nblocks;		/* sum of blocks for all I/Os */
+	uint64		io_in_progress;	/* sum of in-progress I/Os */
+} TableScanStatsData;
+typedef struct TableScanStatsData *TableScanStats;

--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -41,9 +41,51 @@ typedef struct SharedAggInfo

/* ---------------------
- *	Instrumentation information for indexscans (amgettuple and amgetbitmap)
+ *	Instrumentation information about read streams
* ---------------------
*/
+typedef struct ReadStreamInstrumentation
+{
+	/* number of buffers returned to consumer (for averaging distance) */
+	uint64		prefetch_count;
+
+	/* sum of pinned_buffers sampled at each buffer return */
+	uint64		distance_sum;
+
+	/* maximum actual pinned_buffers observed during the scan */
+	int16		distance_max;
+
+	/* maximum possible look-ahead distance (max_pinned_buffers) */
+	int16		distance_capacity;
+
+	/* number of stalled reads (waiting for I/O) */
+	uint64		stall_count;
+
+	/* I/O stats */
+	uint64		io_count;		/* number of I/Os */
+	uint64		io_nblocks;		/* sum of blocks for all I/Os */
+	uint64		io_in_progress;	/* sum of in-progress I/Os */
+} ReadStreamInstrumentation;

I don't get what we gain by having this twice, once as TableScanStatsData and
once as ReadStreamInstrumentation.

Greetings,

Andres Freund

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

20 days ago

In reply to: Andres Freund (#9)

Re: EXPLAIN: showing ReadStream / prefetch stats

On 3/17/26 19:40, Andres Freund wrote:

Hi,

On 2026-03-17 18:51:15 +0100, Tomas Vondra wrote:

Subject: [PATCH v2 2/2] explain: show prefetch stats in EXPLAIN (ANALYZE,
VERBOSE)

This adds details about AIO / prefetch for a number of executor nodes
using the ReadStream, notably:

- SeqScan
- BitmapHeapScan

The statistics is tracked by the ReadStream, and then propagated up
through the table AM interface.

The ReadStream tracks the statistics unconditionally, i.e. even outside
EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
integer counters), it's not worth gating this by a flag.

The TAM gets one new callback "scan_stats", to collect stats from the
scan (which fetch tuples from the TAM). There is also a new struct
TableScanStatsData/TableScanStats to separate the statistics from the
actual TAM implementation.

It's not really clear to me that this need to go through the tableam
interface. All the accesses already happen within AM code, it seems we could
just populate fields in TableScanDesc TableScanDesc->rs_flags indicates that
that is desired.

Interesting idea to pass this data through the scan descriptor ...

That seems like it might be nicer, because

a) there could be multiple read stream instances within a scan
b) we will need one such callback for each different type of scan - so we
could just as well do that within the scan datastructure

AFAICS (a) is very similar to the argument I made regarding the
stack-based instrumentation earlier, as it wasn't clear to me how would
that work for scans with multiple streams. It didn't occur to me it's an
issue for the current solution with TAM callbacks too.

However, does passing the info through the scan descriptor address this?
I don't see how could it, if there's just a single field per scan.

Also, what do you mean by "one callback for each different type of
scan"? Right now we need two callbacks - one for TableScanDesc and one
for IndexScanDesc. Do we expect to get more?

I don't see much difference between an optional TAM callback for each
scan type, and an optional field in each scan descriptor.

We can base this on rs_flags, but wouldn't it be required in all
table_beginscan variants anyway? I have initially planned to pass a flag
indicating we're in EXPLAIN (ANALYZE), but rs_flags is not that. It's
hard-coded in tableam.

+		case T_SeqScan:
+			{
+				SharedSeqScanInstrumentation *sinstrument
+					= ((SeqScanState *) planstate)->sinstrument;
+
+				/* get the sum of the counters set within each and every process */
+				if (sinstrument)
+				{
+					for (int i = 0; i < sinstrument->num_workers; ++i)
+					{
+						SeqScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
+
+						stats.prefetch_count += winstrument->stream.prefetch_count;
+						stats.distance_sum += winstrument->stream.distance_sum;
+						if (winstrument->stream.distance_max > stats.distance_max)
+							stats.distance_max = winstrument->stream.distance_max;
+						if (winstrument->stream.distance_capacity > stats.distance_capacity)
+							stats.distance_capacity = winstrument->stream.distance_capacity;
+						stats.stall_count += winstrument->stream.stall_count;
+						stats.io_count += winstrument->stream.io_count;
+						stats.io_nblocks += winstrument->stream.io_nblocks;
+						stats.io_in_progress += winstrument->stream.io_in_progress;
+					}
+				}
+
+				break;
+			}
+		case T_BitmapHeapScan:
+			{
+				SharedBitmapHeapInstrumentation *sinstrument
+					= ((BitmapHeapScanState *) planstate)->sinstrument;
+
+				/* get the sum of the counters set within each and every process */
+				if (sinstrument)
+				{
+					for (int i = 0; i < sinstrument->num_workers; ++i)
+					{
+						BitmapHeapScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
+
+						stats.prefetch_count += winstrument->stream.prefetch_count;
+						stats.distance_sum += winstrument->stream.distance_sum;
+						if (winstrument->stream.distance_max > stats.distance_max)
+							stats.distance_max = winstrument->stream.distance_max;
+						if (winstrument->stream.distance_capacity > stats.distance_capacity)
+							stats.distance_capacity = winstrument->stream.distance_capacity;
+						stats.stall_count += winstrument->stream.stall_count;
+						stats.io_count += winstrument->stream.io_count;
+						stats.io_nblocks += winstrument->stream.io_nblocks;
+						stats.io_in_progress += winstrument->stream.io_in_progress;

It doesn't seem good to duplicate these fairly large blocks. Can't that be in
a helper?

Yes, good point.

+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			/* prefetch distance info */
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
+							 (stats.distance_sum * 1.0 / stats.prefetch_count),
+							 stats.distance_max,
+							 stats.distance_capacity);
+			appendStringInfoChar(es->str, '\n');
+
+			/* prefetch I/O info (only if there were actual I/Os) */
+			if (stats.stall_count > 0 || stats.io_count > 0)
+			{

Cann there be stalls without IO? Why check both?

I don't recall. I think I originally checked only (io_count > 0), but
then at some point added the other condition. I should have documented
the reason better, I'll check.

+				ExplainIndentText(es);
+				appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
+								 stats.stall_count);
+
+				if (stats.io_count > 0)
+				{
+					appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
+									 (stats.io_nblocks * 1.0 / stats.io_count),
+									 (stats.io_in_progress * 1.0 / stats.io_count));
+				}
+
+				appendStringInfoChar(es->str, '\n');
+			}
+		}
+		else
+		{
+			ExplainOpenGroup("Prefetch", "I/O", true, es);
+
+			ExplainPropertyFloat("Average Distance", NULL,
+								 (stats.distance_sum * 1.0 / stats.prefetch_count), 3, es);
+			ExplainPropertyInteger("Max Distance", NULL,
+								   stats.distance_max, es);
+			ExplainPropertyInteger("Capacity", NULL,
+								   stats.distance_capacity, es);
+			ExplainPropertyUInteger("Stalls", NULL,
+									stats.stall_count, es);
+
+			if (stats.io_count > 0)
+			{
+				ExplainPropertyFloat("Average IO Size", NULL,
+									 (stats.io_nblocks * 1.0 / stats.io_count), 3, es);
+				ExplainPropertyFloat("Average IOs In Progress", NULL,
+									 (stats.io_in_progress * 1.0 / stats.io_count), 3, es);
+			}
+
+			ExplainCloseGroup("Prefetch", "I/O", true, es);
+		}
+	}
+}
+
+/*
+ * show_io_worker_info
+ *		show info about prefetching for a single worker
+ *
+ * Shows prefetching stats for a parallel scan worker.
+ */
+static void
+show_worker_io_info(PlanState *planstate, ExplainState *es, int worker)
+{
+	Plan	   *plan = planstate->plan;
+	ReadStreamInstrumentation *stats = NULL;
+
+	if (!es->io)
+		return;
+
+	/* get instrumentation for the given worker */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapHeapScan:
+			{
+				BitmapHeapScanState *state = ((BitmapHeapScanState *) planstate);
+				SharedBitmapHeapInstrumentation *sinstrument = state->sinstrument;
+				BitmapHeapScanInstrumentation *instrument = &sinstrument->sinstrument[worker];
+
+				stats = &instrument->stream;
+
+				break;
+			}
+		case T_SeqScan:
+			{
+				SeqScanState *state = ((SeqScanState *) planstate);
+				SharedSeqScanInstrumentation *sinstrument = state->sinstrument;
+				SeqScanInstrumentation *instrument = &sinstrument->sinstrument[worker];
+
+				stats = &instrument->stream;
+
+				break;
+			}
+		default:
+			/* ignore other plans */
+			return;
+	}
+
+	/* don't print stats if there's nothing to report */
+	if (stats->prefetch_count > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			/* prefetch distance info */
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
+							 (stats->distance_sum * 1.0 / stats->prefetch_count),
+							 stats->distance_max,
+							 stats->distance_capacity);
+			appendStringInfoChar(es->str, '\n');
+
+			/* prefetch I/O info (only if there were actual I/Os) */
+			if (stats->stall_count > 0 || stats->io_count > 0)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
+								 stats->stall_count);
+
+				if (stats->io_count > 0)
+				{
+					appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
+									 (stats->io_nblocks * 1.0 / stats->io_count),
+									 (stats->io_in_progress * 1.0 / stats->io_count));
+				}
+
+				appendStringInfoChar(es->str, '\n');
+			}
+		}
+		else
+		{
+			ExplainOpenGroup("Prefetch", "I/O", true, es);
+
+			ExplainPropertyFloat("Average Distance", NULL,
+								 (stats->distance_sum * 1.0 / stats->prefetch_count), 3, es);
+			ExplainPropertyInteger("Max Distance", NULL,
+								   stats->distance_max, es);
+			ExplainPropertyInteger("Capacity", NULL,
+								   stats->distance_capacity, es);
+			ExplainPropertyUInteger("Stalls", NULL,
+									stats->stall_count, es);
+
+			if (stats->io_count > 0)
+			{
+				ExplainPropertyFloat("Average IO Size", NULL,
+									 (stats->io_nblocks * 1.0 / stats->io_count), 3, es);
+				ExplainPropertyFloat("Average IOs In Progress", NULL,
+									 (stats->io_in_progress * 1.0 / stats->io_count), 3, es);
+			}
+
+			ExplainCloseGroup("Prefetch", "I/O", true, es);
+		}
+	}
+}

Lots of duplication between show_scan_io_info() and show_worker_io_info(). I
assume that's just due to prototype state?

Yes, I think it needs helpers as mentioned earlier, etc.

+/*
+ * read_stream_update_stats_prefetch
+ *		update read_stream stats with current pinned buffer depth
Other stuff in read stream does not do this absurd thing of restating the
function name in a comment and indenting the "title" of the function. Don't do
it here.

+ *
+ * Called once per buffer returned to the consumer in read_stream_next_buffer().
+ * Records the number of pinned buffers at that moment, so we can compute the
+ * average look-ahead depth.
+ */
+static inline void
+read_stream_update_stats_prefetch(ReadStream *stream)
+{
+	stream->stats.prefetch_count++;
+	stream->stats.distance_sum += stream->pinned_buffers;
+	if (stream->pinned_buffers > stream->stats.distance_max)
+		stream->stats.distance_max = stream->pinned_buffers;
+}

+/*
+ * read_stream_update_stats_io
+ *		update read_stream stats about size of I/O requests
+ *
+ * We count the number of I/O requests, size of requests (counted in blocks)
+ * and number of in-progress I/Os.
+ */
+static inline void
+read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
+{
+	stream->stats.io_count++;
+	stream->stats.io_nblocks += nblocks;
+	stream->stats.io_in_progress += in_progress;
+}

What's the point of the caller passing in stream->ios_in_progress? Are you
expecting some callers to pass in different values? If so, why?

Are you suggesting we remove the argument and just use
stream->ios_in_progress directly? Yes, we can do do that. An earlier
version of the patch did pass (stream->ios_in_progress+1) from one
place, but after that was removed we don't need the argument.

Seems a bit confusing to have both stream->stats.io_in_progress and
streams->ios_in_progress. It's also not clear that stats.io_in_progress is a
sum from the name.

I don't think it's all that confusing - those fields are in different
structs. But I agree stream->stats.io_in_progress should indicate it's a
sum of values.

@@ -851,6 +894,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
flags)))
{
/* Fast return. */
+				read_stream_update_stats_prefetch(stream);
return buffer;
}

@@ -860,6 +904,9 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
stream->ios_in_progress = 1;
stream->ios[0].buffer_index = oldest_buffer_index;
stream->seq_blocknum = next_blocknum + 1;
+
+			/* update I/O stats */
+			read_stream_update_stats_io(stream, 1, stream->ios_in_progress);
}
else
{
@@ -871,6 +918,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
}

stream->fast_path = false;
+ read_stream_update_stats_prefetch(stream);
return buffer;
}
#endif

Should read_stream_update_stats_prefetch() actually be called when we reached
the end of the stream?

Hmmm, maybe we should not update the stats if

(next_blocknum == InvalidBlockNumber)

@@ -916,12 +964,17 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
{
int16 io_index = stream->oldest_io_index;
int32 distance; /* wider temporary value, clamped below */
+ bool needed_wait;

/* Sanity check that we still agree on the buffers. */
Assert(stream->ios[io_index].op.buffers ==
&stream->buffers[oldest_buffer_index]);
-		WaitReadBuffers(&stream->ios[io_index].op);
+		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
+
+		/* Count it as a stall if we need to wait for IO */
+		if (needed_wait)
+			stream->stats.stall_count += 1;
Assert(stream->ios_in_progress > 0);
stream->ios_in_progress--;
I'd probably put the stalls_count++ in a helper too, that makes it easier to
test compiling out the stats support and similar things.

OK, makes sense

@@ -328,6 +329,21 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
si->exact_pages += node->stats.exact_pages;
si->lossy_pages += node->stats.lossy_pages;
+
+		/* collect prefetch info for this process from the read_stream */
+		if ((stats = table_scan_stats(node->ss.ss_currentScanDesc)) != NULL)
+		{
+			si->stream.prefetch_count += stats->prefetch_count;
+			si->stream.distance_sum += stats->distance_sum;
+			if (stats->distance_max > si->stream.distance_max)
+				si->stream.distance_max = stats->distance_max;
+			if (stats->distance_capacity > si->stream.distance_capacity)
+				si->stream.distance_capacity = stats->distance_capacity;
+			si->stream.stall_count += stats->stall_count;
+			si->stream.io_count += stats->io_count;
+			si->stream.io_nblocks += stats->io_nblocks;
+			si->stream.io_in_progress += stats->io_in_progress;
+		}
}

Could this be wrapped in a helper function?

Yes, similarly to the earlier comment.

+/*
+ * Generic prefetch stats for table scans.
+ */
+typedef struct TableScanStatsData
+{
+	/* number of buffers returned to consumer (for averaging distance) */
+	uint64		prefetch_count;
+
+	/* sum of pinned_buffers sampled at each buffer return */
+	uint64		distance_sum;
+
+	/* maximum actual pinned_buffers observed during the scan */
+	int16		distance_max;
+
+	/* maximum possible look-ahead distance (max_pinned_buffers) */
+	int16		distance_capacity;
+
+	/* number of stalled reads (waiting for I/O) */
+	uint64		stall_count;
+
+	/* I/O stats */
+	uint64		io_count;		/* number of I/Os */
+	uint64		io_nblocks;		/* sum of blocks for all I/Os */
+	uint64		io_in_progress;	/* sum of in-progress I/Os */
+} TableScanStatsData;
+typedef struct TableScanStatsData *TableScanStats;

--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -41,9 +41,51 @@ typedef struct SharedAggInfo

/* ---------------------
- *	Instrumentation information for indexscans (amgettuple and amgetbitmap)
+ *	Instrumentation information about read streams
* ---------------------
*/
+typedef struct ReadStreamInstrumentation
+{
+	/* number of buffers returned to consumer (for averaging distance) */
+	uint64		prefetch_count;
+
+	/* sum of pinned_buffers sampled at each buffer return */
+	uint64		distance_sum;
+
+	/* maximum actual pinned_buffers observed during the scan */
+	int16		distance_max;
+
+	/* maximum possible look-ahead distance (max_pinned_buffers) */
+	int16		distance_capacity;
+
+	/* number of stalled reads (waiting for I/O) */
+	uint64		stall_count;
+
+	/* I/O stats */
+	uint64		io_count;		/* number of I/Os */
+	uint64		io_nblocks;		/* sum of blocks for all I/Os */
+	uint64		io_in_progress;	/* sum of in-progress I/Os */
+} ReadStreamInstrumentation;

I don't get what we gain by having this twice, once as TableScanStatsData and
once as ReadStreamInstrumentation.

Nothing, really. I already suggested we should have just one struct
yesterday.

Thanks!

--
Tomas Vondra

#11

Andres Freund

andres@anarazel.de

20 days ago

In reply to: Tomas Vondra (#10)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi,

On 2026-03-17 20:32:44 +0100, Tomas Vondra wrote:

On 3/17/26 19:40, Andres Freund wrote:

On 2026-03-17 18:51:15 +0100, Tomas Vondra wrote:

Subject: [PATCH v2 2/2] explain: show prefetch stats in EXPLAIN (ANALYZE,
VERBOSE)

This adds details about AIO / prefetch for a number of executor nodes
using the ReadStream, notably:

- SeqScan
- BitmapHeapScan

The statistics is tracked by the ReadStream, and then propagated up
through the table AM interface.

The ReadStream tracks the statistics unconditionally, i.e. even outside
EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
integer counters), it's not worth gating this by a flag.

The TAM gets one new callback "scan_stats", to collect stats from the
scan (which fetch tuples from the TAM). There is also a new struct
TableScanStatsData/TableScanStats to separate the statistics from the
actual TAM implementation.

It's not really clear to me that this need to go through the tableam
interface. All the accesses already happen within AM code, it seems we could
just populate fields in TableScanDesc TableScanDesc->rs_flags indicates that
that is desired.

Interesting idea to pass this data through the scan descriptor ...

That seems like it might be nicer, because

a) there could be multiple read stream instances within a scan
b) we will need one such callback for each different type of scan - so we
could just as well do that within the scan datastructure

AFAICS (a) is very similar to the argument I made regarding the
stack-based instrumentation earlier, as it wasn't clear to me how would
that work for scans with multiple streams. It didn't occur to me it's an
issue for the current solution with TAM callbacks too.

However, does passing the info through the scan descriptor address this?
I don't see how could it, if there's just a single field per scan.

I'd expect the AM to merge the stats internally.

Also, what do you mean by "one callback for each different type of
scan"? Right now we need two callbacks - one for TableScanDesc and one
for IndexScanDesc.

Those two were what I was referring to. The only reason we don't need more is
that we already are combining different scan types (seq, bitmap, tid, sample)
via TableScanDesc, so we really already are quite strongly associating the
stats with the *ScanDesc structures.

Regardless of how we structure it, since we already combine stats for streams
in the parallel query stats, perhaps we should add a field to track for how
many streams the stats are?

Do we expect to get more?

I think we eventually might want to use read streams for things like getting
the source tuple to update/delete in a bulk update and for detoasting during
expression evaluation (which is surprisingly often IO bound). These would be
bound to an AM but currently we don't have a real descriptor (however, the
need for that has come up repeatedly, e.g. to be able to use multi_insert for
INSERT statements).

I also suspect that there are some executor nodes that could benefit from
doing streaming IO internally, e.g. Sort, Materialize, spilling Hash Agg, Hash
joins... But that'd not be abstracted behind an AM or such, so it's not
really orthogonal.

For things like bulk inserts it would be quite nice if we could use AIO for
reading in index pages with fewer synchronous waits by doing a loop over the
to-be-inserted entries and doing an index lookup that stops at the first miss
and start reading that in.

Before thinking about detoasting I had a bit of a hard time seeing a clear
advantage of using the stack based infra - thinking we always should have very
clear association between the executor node and the stream (in contrast to the
buffer stats which are collected without any possibility to make such an
association). However, if we wanted to e.g. track streams that are done as
part of expression evaluation - I am not at all sure whether we do - that'd
dissolve that clear association, and would be much easier to implement with
the stack based approach.

I don't see much difference between an optional TAM callback for each
scan type, and an optional field in each scan descriptor.

It's not entirely different, I agree.

The scan descriptor approach seems a tad nicer because it means that a) the
read stream can be terminated once it's not needed before we get to the stats
collection phase and b) it allows to combine the stats for multiple streams,
without the AM having to store the stats in the AM specific part of the
descriptor, in a field that is then just returned by the AM.

We can base this on rs_flags, but wouldn't it be required in all
table_beginscan variants anyway? I have initially planned to pass a flag
indicating we're in EXPLAIN (ANALYZE), but rs_flags is not that. It's
hard-coded in tableam.

I think Melanie has a patch that changes that :)

Until then you could just add a helper that ORs in the instrumentation flag.

+/*
+ * read_stream_update_stats_io
+ *		update read_stream stats about size of I/O requests
+ *
+ * We count the number of I/O requests, size of requests (counted in blocks)
+ * and number of in-progress I/Os.
+ */
+static inline void
+read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
+{
+	stream->stats.io_count++;
+	stream->stats.io_nblocks += nblocks;
+	stream->stats.io_in_progress += in_progress;
+}
What's the point of the caller passing in stream->ios_in_progress? Are you
expecting some callers to pass in different values? If so, why?
Are you suggesting we remove the argument and just use
stream->ios_in_progress directly?

Yes, that's what I am suggesting.

Greetings,

Andres Freund

#12

Tomas Vondra

tomas.vondra@2ndquadrant.com

20 days ago

In reply to: Andres Freund (#11)

Re: EXPLAIN: showing ReadStream / prefetch stats

On 3/17/26 21:49, Andres Freund wrote:

Hi,

On 2026-03-17 20:32:44 +0100, Tomas Vondra wrote:

On 3/17/26 19:40, Andres Freund wrote:

On 2026-03-17 18:51:15 +0100, Tomas Vondra wrote:

Subject: [PATCH v2 2/2] explain: show prefetch stats in EXPLAIN (ANALYZE,
VERBOSE)

This adds details about AIO / prefetch for a number of executor nodes
using the ReadStream, notably:

- SeqScan
- BitmapHeapScan

The statistics is tracked by the ReadStream, and then propagated up
through the table AM interface.

The ReadStream tracks the statistics unconditionally, i.e. even outside
EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
integer counters), it's not worth gating this by a flag.

The TAM gets one new callback "scan_stats", to collect stats from the
scan (which fetch tuples from the TAM). There is also a new struct
TableScanStatsData/TableScanStats to separate the statistics from the
actual TAM implementation.

It's not really clear to me that this need to go through the tableam
interface. All the accesses already happen within AM code, it seems we could
just populate fields in TableScanDesc TableScanDesc->rs_flags indicates that
that is desired.

Interesting idea to pass this data through the scan descriptor ...

That seems like it might be nicer, because

a) there could be multiple read stream instances within a scan
b) we will need one such callback for each different type of scan - so we
could just as well do that within the scan datastructure

AFAICS (a) is very similar to the argument I made regarding the
stack-based instrumentation earlier, as it wasn't clear to me how would
that work for scans with multiple streams. It didn't occur to me it's an
issue for the current solution with TAM callbacks too.

However, does passing the info through the scan descriptor address this?
I don't see how could it, if there's just a single field per scan.

I'd expect the AM to merge the stats internally.

Hmmm, I'm not sure that'll be good enough. What if the two streams are
fundamentally different? For example, in the future we may end up using
two streams in index scans. One stream for the table (which is what the
prefetching is doing), and one for the index itself. Does it really make
sense to merge stats for the streams?

Furthermore, which place would be responsible for collecting and merging
the stats? AFAICS it needs to happen at the end of the scan, but it
can't happen in endscan, because that frees the descriptor (so we can't
store the result there). Wouldn't we need a new callback anyway?

Also, what do you mean by "one callback for each different type of
scan"? Right now we need two callbacks - one for TableScanDesc and one
for IndexScanDesc.

Those two were what I was referring to. The only reason we don't need more is
that we already are combining different scan types (seq, bitmap, tid, sample)
via TableScanDesc, so we really already are quite strongly associating the
stats with the *ScanDesc structures.

Regardless of how we structure it, since we already combine stats for streams
in the parallel query stats, perhaps we should add a field to track for how
many streams the stats are?

We could. But does it make sense to combine stats from distinct streams?

Do we expect to get more?

I think we eventually might want to use read streams for things like getting
the source tuple to update/delete in a bulk update and for detoasting during
expression evaluation (which is surprisingly often IO bound). These would be
bound to an AM but currently we don't have a real descriptor (however, the
need for that has come up repeatedly, e.g. to be able to use multi_insert for
INSERT statements).

True. But then again - does it make sense to "merge" stats form all
these streams used for very different purposes? OTOH I'm not sure it
works with the TAM approach either, we don't want a separate callback
for each individual stream.

I also suspect that there are some executor nodes that could benefit from
doing streaming IO internally, e.g. Sort, Materialize, spilling Hash Agg, Hash
joins... But that'd not be abstracted behind an AM or such, so it's not
really orthogonal.

Agreed.

For things like bulk inserts it would be quite nice if we could use AIO for
reading in index pages with fewer synchronous waits by doing a loop over the
to-be-inserted entries and doing an index lookup that stops at the first miss
and start reading that in.

Yes. I had a very ugly PoC last year to prefetch leaf pages for bulk
inserts, and it was a huge benefit.

Before thinking about detoasting I had a bit of a hard time seeing a clear
advantage of using the stack based infra - thinking we always should have very
clear association between the executor node and the stream (in contrast to the
buffer stats which are collected without any possibility to make such an
association). However, if we wanted to e.g. track streams that are done as
part of expression evaluation - I am not at all sure whether we do - that'd
dissolve that clear association, and would be much easier to implement with
the stack based approach.

Right. In some cases the association to streams may not be quite clear.

Would it be possible for scans to "define" a list of streams? For
example, an index scan have one stream for the index, one stream for
table, one stream for TOAST. And we'd have an array of stats, with one
slot for each stream.

With the current approach, we'd collect stats from each of all those
streams into the correct "slot" in the instrumentation.

With the stack-based instrumentation we might "tell" each stream which
slot to use. So ReadStream would get a field "stats_index" and it'd be
passed to INSTR_IOUSAGE_INCR() to update the correct slot.

I haven't tried any of this, and AFAIK we don't even have a scan using
two read streams yet.

I don't see much difference between an optional TAM callback for each
scan type, and an optional field in each scan descriptor.

It's not entirely different, I agree.

The scan descriptor approach seems a tad nicer because it means that a) the
read stream can be terminated once it's not needed before we get to the stats
collection phase and b) it allows to combine the stats for multiple streams,
without the AM having to store the stats in the AM specific part of the
descriptor, in a field that is then just returned by the AM.

We can base this on rs_flags, but wouldn't it be required in all
table_beginscan variants anyway? I have initially planned to pass a flag
indicating we're in EXPLAIN (ANALYZE), but rs_flags is not that. It's
hard-coded in tableam.

I think Melanie has a patch that changes that :)

Until then you could just add a helper that ORs in the instrumentation flag.

OK.

+/*
+ * read_stream_update_stats_io
+ *		update read_stream stats about size of I/O requests
+ *
+ * We count the number of I/O requests, size of requests (counted in blocks)
+ * and number of in-progress I/Os.
+ */
+static inline void
+read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
+{
+	stream->stats.io_count++;
+	stream->stats.io_nblocks += nblocks;
+	stream->stats.io_in_progress += in_progress;
+}
What's the point of the caller passing in stream->ios_in_progress? Are you
expecting some callers to pass in different values? If so, why?
Are you suggesting we remove the argument and just use
stream->ios_in_progress directly?
Yes, that's what I am suggesting.

OK, will do.

regards

--
Tomas Vondra

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

19 days ago

In reply to: Tomas Vondra (#12)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi!

Here's a v3 of the patch, adopting the idea of passing the data through
a field in the scan descriptor. I've kept it broken into multiple
smaller pieces 0002-0006 to show the incremental changes, but in the end
this should be all merged.

0001 - a prerequisite patch

0002 - v2 of the patch

0003 - basic cleanup / helpers to reduce code duplication, various
smaller fixes and improvements, ...

0004 - changes the design to use a new field in TableScanDesc

0005 - minor cleanup of 0004

0006 - support for TidRangeScan (the last scan using ReadStream),
requires changes similar to SeqScan in v2

Overall, I think it looks reasonable. There are probably ways to
simplify it further (e.g. by introducing a new read_stream_begin_
helper, so that it's not necessary to update every caller of
read_stream_begin_relation by adding a NULL argument).

The 0003 also changes the EXPLAIN to enable IO by default, just like we
do for BUFFERS. It seems like a reasonable precedent to me.

0004 reworks the tracking to use a field in TableScanDesc, instead of
adding a table AM callback. I agree it seems simpler / less disruptive.

0005 is a minor cleanup for 0004

0006 adds EXPLAIN support for TidRangeScan, similarly to how 0002 adds
support for SeqScan (but that still uses the TAM interface)

regards

--
Tomas Vondra

#14

Lukas Fittl

lukas@fittl.com

19 days ago

In reply to: Tomas Vondra (#13)

Re: EXPLAIN: showing ReadStream / prefetch stats

On Wed, Mar 18, 2026 at 3:41 PM Tomas Vondra <tomas@vondra.me> wrote:

The 0003 also changes the EXPLAIN to enable IO by default, just like we
do for BUFFERS. It seems like a reasonable precedent to me.

One side effect of that is that the tests now fail for me locally,
because the specific values are system-dependent. Attached a patch
(nocfbot-0002) that fixed that for me.

There is one detail maybe calling out specifically on JSON output:
Currently Postgres always emits all fields in JSON output, even if
they are zero. The code that you have in v3 skips the "I/O" group when
the value is zero, which doesn't work well with how current regression
tests are written. I'm definitely not a fan of the unnecessary
verbosity of JSON EXPLAIN output, but I'd suggest we don't break with
the tradition here, and instead always output the "I/O" group in
non-text formats. Also attached a patch for that (nocfbot-0001).

Overall I think the abstraction here seems reasonable if we're
primarily focused on getting the per-node instrumentation taken care
of.

That said, two thoughts on an example EXPLAIN output I just ran:

1) I do wonder if its a bit confusing that we propagate I/O timings up
the EXPLAIN tree, but not the "I/O" information - I realize fixing
that would be a bit involved though, e.g. we'd have to invent
accumulation logic in explain.c. It'd also maybe make people thing
this covers things like temporary file reads/etc.

2) The ordering of "I/O Timings" in relation to "I/O" feels off to me
(since they're not next to each other) - maybe we should re-order I/O
Timings to come before Buffers in show_buffer_usage to address that?

EXPLAIN (ANALYZE) SELECT COUNT(*) FROM t;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=218331.00..218331.01 rows=1 width=8) (actual
time=563.437..563.437 rows=1.00 loops=1)
Buffers: shared hit=15806 read=41274
I/O Timings: shared read=1.180
-> Seq Scan on t (cost=0.00..186080.80 rows=12900080 width=0)
(actual time=0.335..306.737 rows=12900005.00 loops=1)
Prefetch: avg=61.517 max=91 capacity=94
I/O: stalls=7 size=14.825 inprogress=5.321
Buffers: shared hit=15806 read=41274
I/O Timings: shared read=1.180
Planning Time: 0.101 ms
Execution Time: 563.471 ms
(10 rows)

Thanks,
Lukas

--
Lukas Fittl

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

18 days ago

In reply to: Lukas Fittl (#14)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi Lukas,

Thanks for the review. I've been working on cleaning up the patch,
reordering the changes a bit. I ran into most of the issues you
mentioned, and the attached v4 should address most of them.

I got rid of the "old" approach entirely, and just do everything through
the scan descriptor. I reorganized the patch series to make more sense -
it's split into three pieces (0001 is a prerequisite patch, not really a
part of the changes):

* 0002 - Basic changes, adding the info to a single scan node that
already uses shared instrumentation (=BitmapHeapScan). It adds the
counting to read_stream, and printing to explain. It also adjusts all
the failing tests checking query plans (usually by disabling I/O).

* 0003 - Adds support for SeqScan (has to setup the instrumentation for
parallel workers, ...).

* 0004 - Adds support for TidRangeScan (has to setup the instrumentation
for parallel workers, ...).

I think it'd make sense to commit these parts separately.

The 0002 part could be made smaller by introducing a new read_stream
helper, so that we don't need to add a parameter to all the places
calling read_stream_begin_relation(). But that's a detail.

More comments inline.

On 3/19/26 19:11, Lukas Fittl wrote:

On Wed, Mar 18, 2026 at 3:41 PM Tomas Vondra <tomas@vondra.me> wrote:

The 0003 also changes the EXPLAIN to enable IO by default, just like we
do for BUFFERS. It seems like a reasonable precedent to me.

One side effect of that is that the tests now fail for me locally,
because the specific values are system-dependent. Attached a patch
(nocfbot-0002) that fixed that for me.

Right. I haven't addressed that in v3, but with v4 the tests should be
passing (after each patch).

There is one detail maybe calling out specifically on JSON output:
Currently Postgres always emits all fields in JSON output, even if
they are zero. The code that you have in v3 skips the "I/O" group when
the value is zero, which doesn't work well with how current regression
tests are written. I'm definitely not a fan of the unnecessary
verbosity of JSON EXPLAIN output, but I'd suggest we don't break with
the tradition here, and instead always output the "I/O" group in
non-text formats. Also attached a patch for that (nocfbot-0001).

Yeah, I noticed that too, but I wasn't sure what to do about it. The I/O
stats are conditional on (io_count > 0) because it calculates average
I/O size etc. If there are no I/Os, we could probably print 0.0.

Another thing I noticed in the non-text formats is that this is the only
case with explicit groups, which adds nesting. It looks weird, so I plan
to remove that, and will add some "prefix" to the items.

Overall I think the abstraction here seems reasonable if we're
primarily focused on getting the per-node instrumentation taken care
of.

OK, good. I think passing the stats through a scan descriptor works OK.
I still have a strange feeling about it - we have the TAM interface and
yet we choose to pass data in other ways. It's better than global
variables, of course. But it's not the only piece of data returned
through the scan descriptor, so at least there's a precedent.

That said, two thoughts on an example EXPLAIN output I just ran:

1) I do wonder if its a bit confusing that we propagate I/O timings up
the EXPLAIN tree, but not the "I/O" information - I realize fixing
that would be a bit involved though, e.g. we'd have to invent
accumulation logic in explain.c. It'd also maybe make people thing
this covers things like temporary file reads/etc.

I'm not sure it makes sense to combine this information from multiple
nodes. If you have multiple read streams with very different average I/O
size and/or prefetch distances, what does the "global average" mean?

I also don't think it's very useful - the information about distance
and/or I/O size is most useful when analyzing individual nodes. What
would it tell me for multiple nodes combined?

So I don't think it's something we need, and we can add it later.

2) The ordering of "I/O Timings" in relation to "I/O" feels off to me
(since they're not next to each other) - maybe we should re-order I/O
Timings to come before Buffers in show_buffer_usage to address that?

Yeah, maybe. I don't think we have any explicit order, but why not.

Thanks!

--
Tomas Vondra

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

12 days ago

In reply to: Tomas Vondra (#15)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi,

Here's a slightly improved v5, with a couple minor changes over v4.

Patches 0001 and 0002 are prerequisites from other threads. 0001 was
already included in v4. 0002 is new and comes from [1], and it adds
"flags" to TAM beginscan() methods. This allows 0003 to pass a new
SO_SCAN_INSTRUMENT flag, so that e.g. heap_beginscan() can initialize
the instrumentation only when executed in EXPLAIN ANALYZE. In v4 it had
to be done every time, which did not seem quite great.

This mimics what we do for index scans in the index prefetching, which
only allocates the instrumentation when needed.

0004 and 0005 are adjusted to do this for SeqScan and TidRangeScans too.

0006 is a minor adjustment of the explain output, based on the earlier
discussion about non-text formats. It removes the explain groups (to not
add nested output), and instead makes the labels more descriptive. It
also prints the items every time, even with io_count==0.

The last change is that I added the number of I/Os, instead of printing
just the number of stalls and average I/O size. Without the count it
seemed a bit difficult to interpret. It might be possible to deduce some
of this from buffers reads, but I found it a bit unclear.

0006 should be split and merged into the earlier parts, I mostly kept it
to make those changes obvious.

regards

/messages/by-id/itvgqc6vncbjsjfmrptfvkkeg5vqzhalaguya2z77t6c6ctpc3@wsdrgbn4bxaa

--
Tomas Vondra

#17

Andres Freund

andres@anarazel.de

8 days ago

In reply to: Tomas Vondra (#16)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi,

Subject: [PATCH v5 1/6] bufmgr: Return whether WaitReadBuffers() needed to
wait

In a subsequent commit read_stream.c will use this as an input to the read
ahead distance.

Note that this actually depends on a prior commit to work for io_uring...

From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 19 Mar 2026 22:25:09 +0100
Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE)

This adds details about AIO / prefetch for executor nodes using a
ReadStream. Right now this applies only to BitmapHeapScan, because
that's the only scan node using a ReadStream and collecting
instrumentation from workers.

I don't understand why that means it should be done as part of this commit,
whereas seqscans shouldn't?

The stats are collected by the ReadStream unconditionally, i.e. without
EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
integer counters), it's not worth gating this by a flag.

I think this is not true anymore.

--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
TIMING [ <replaceable class="parameter">boolean</replaceable> ]
SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
+    IO [ <replaceable class="parameter">boolean</replaceable> ]
FORMAT { TEXT | XML | JSON | YAML }
</synopsis>
</refsynopsisdiv>
@@ -298,6 +299,17 @@ ROLLBACK;
</listitem>
</varlistentry>

+   <varlistentry>
+    <term><literal>IO</literal></term>
+    <listitem>
+     <para>
+      Include information on I/O performed by each node.
+      This parameter may only be used when <literal>ANALYZE</literal> is also
+      enabled.  It defaults to <literal>TRUE</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
<varlistentry>
<term><literal>FORMAT</literal></term>
<listitem>

I wonder if it's sane to default this to true. The amount of adjustments in
the regression tests required in the later commits makes me a bit wary. But I
guess just going around and adding a few dozen OPTION OFF is what we've lately
been doing, so perhaps it's ok.

@@ -1198,6 +1198,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_nkeys = nkeys;
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
+ scan->rs_base.rs_instrument = NULL;
scan->rs_strategy = NULL; /* set in initscan */
scan->rs_cbuf = InvalidBuffer;

@@ -1267,6 +1268,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,

scan->rs_read_stream = NULL;
+	/* allocate instrumentation */
+	if (flags & SO_SCAN_INSTRUMENT)
+		scan->rs_base.rs_instrument = palloc0_object(TableScanInstrumentation);
+	else
+		scan->rs_base.rs_instrument = NULL;
+
/*
* Set up a read stream for sequential scans and TID range scans. This
* should be done after initscan() because initscan() allocates the
@@ -1296,7 +1303,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
MAIN_FORKNUM,
cb,
scan,
-														  0);
+														  0,
+														  &scan->rs_base.rs_instrument->io);
}
else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
{
@@ -1307,7 +1315,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
MAIN_FORKNUM,
bitmapheap_stream_read_next,
scan,
-														  sizeof(TBMIterateResult));
+														  sizeof(TBMIterateResult),
+														  &scan->rs_base.rs_instrument->io);
}
scan->rs_vmbuffer = InvalidBuffer;

It doesn't seem great to dereference rs_base.rs_instrument if it's NULL, even
if that works because you then just compute the address of that. I think
you'll end up passing the offset of io in TableScanInstrumentation this way.

+static void
+print_io_usage(ExplainState *es, IOStats *stats)
+{
+	/* don't print stats if there's nothing to report */
+	if (stats->prefetch_count > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			/* prefetch distance info */
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
+							 (stats->distance_sum * 1.0 / stats->prefetch_count),
+							 stats->distance_max,
+							 stats->distance_capacity);
+			appendStringInfoChar(es->str, '\n');
+
+			/* prefetch I/O info (only if there were actual I/Os) */
+			if (stats->io_count > 0)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
+								 stats->stall_count);
+
+				appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
+								 (stats->io_nblocks * 1.0 / stats->io_count),
+								 (stats->io_in_progress * 1.0 / stats->io_count));
+
+				appendStringInfoChar(es->str, '\n');
+			}

Why have these separate appendStringInfoChar's if you just could just include
the \n in the prior appendStringInfo?

Separately, is %.3f really the right thing? When is that degree of precision
useful here?

diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 7e2c1b7467b..d181fd7b8ff 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -104,6 +104,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
TBMIterator tbmiterator = {0};
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
+	EState *estate = node->ss.ps.state;

if (!pstate)
{
@@ -146,7 +147,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
{
node->ss.ss_currentScanDesc =
table_beginscan_bm(node->ss.ss_currentRelation,
-							   0,
+							   (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
node->ss.ps.state->es_snapshot,
0,
NULL);
@@ -329,6 +330,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
si->exact_pages += node->stats.exact_pages;
si->lossy_pages += node->stats.lossy_pages;
+
+		/* collect prefetch info for this process from the read_stream */
+		ACCUMULATE_IO_STATS(&si->stats.io, &node->ss.ss_currentScanDesc->rs_instrument->io);
}

I see no point in having ACCUMULATE_IO_STATS be a macro instead of a static
inline?

+/*
+ * Update stream stats with current pinned buffer depth.
+ *
+ * Called once per buffer returned to the consumer in read_stream_next_buffer().
+ * Records the number of pinned buffers at that moment, so we can compute the
+ * average look-ahead depth.
+ */
+static inline void
+read_stream_update_stats_prefetch(ReadStream *stream)
+{
+	IOStats *stats = stream->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->prefetch_count++;
+	stats->distance_sum += stream->pinned_buffers;
+	if (stream->pinned_buffers > stats->distance_max)
+		stats->distance_max = stream->pinned_buffers;
+}
+
+/*
+ * Update stream stats about size of I/O requests.
+ *
+ * We count the number of I/O requests, size of requests (counted in blocks)
+ * and number of in-progress I/Os.
+ */
+static inline void
+read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
+{
+	IOStats *stats = stream->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->io_count++;
+	stats->io_nblocks += nblocks;
+	stats->io_in_progress += in_progress;
+}
+
+static inline void
+read_stream_update_stats_stall(ReadStream *stream)
+{

Wonder if we should rename "stall" to "wait", seems like it might be easier to
understand for a more general audience.

/* ----------------------------------------------------------------
@@ -404,9 +471,50 @@ void
ExecSeqScanInitializeWorker(SeqScanState *node,
ParallelWorkerContext *pwcxt)
{
+ EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
+ char *ptr;
+ Size size;
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan);
+		table_beginscan_parallel(node->ss.ss_currentRelation,
+								 (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
+								 pscan);
+
+	/*
+	 * Workers don't get the pscan_len value in scan descriptor, so use the
+	 * TAM callback again. The result has to match the earlier result in
+	 * ExecSeqScanEstimate.
+	 */
+	size = table_parallelscan_estimate(node->ss.ss_currentRelation,
+									   estate->es_snapshot);
+

That seems quite grotty...

+	ptr = (char *) pscan;
+	ptr += MAXALIGN(size);
+
+	if (node->ss.ps.instrument)
+		node->sinstrument = (SharedSeqScanInstrumentation *) ptr;

Seems like the appropriate instrumentation pointers should be computed by the
leader, not in the workers.

diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 97cb7206676..3fa6d838545 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -93,112 +93,126 @@ select explain_filter('explain (analyze, buffers, io off, format text) select *
Execution Time: N.N ms
(3 rows)
-select explain_filter('explain (analyze, buffers, io off, format xml) select * from int8_tbl i8');
- explain_filter
---------------------------------------------------------
- <explain xmlns="http://www.postgresql.org/N/explain"> +

...

+ <explain xmlns="http://www.postgresql.org/N/explain"> +

...

Seems these tests should really use unaligned output, to avoid this stupid
"every line changes despite not really anything changing" type of diff.

Greetings,

Andres Freund

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

8 days ago

In reply to: Andres Freund (#17)

Re: EXPLAIN: showing ReadStream / prefetch stats

On 3/30/26 19:28, Andres Freund wrote:

Hi,

Subject: [PATCH v5 1/6] bufmgr: Return whether WaitReadBuffers() needed to
wait

In a subsequent commit read_stream.c will use this as an input to the read
ahead distance.

Note that this actually depends on a prior commit to work for io_uring...

Good point.

From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 19 Mar 2026 22:25:09 +0100
Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE)

This adds details about AIO / prefetch for executor nodes using a
ReadStream. Right now this applies only to BitmapHeapScan, because
that's the only scan node using a ReadStream and collecting
instrumentation from workers.

I don't understand why that means it should be done as part of this commit,
whereas seqscans shouldn't?

Are you suggesting the commit adds support for all those scans (BHS,
SeqScan and TidRangeScan) or none of them? To me it seems better to have
at least some scan because of testing. But SeqScan/TidRangeScan don't
have the instrumentation infrastructure for parallel queries, and I
don't want to do that in the main patch - it seems rather unrelated. And
I also don't want to add it before the main patch.

The stats are collected by the ReadStream unconditionally, i.e. without
EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
integer counters), it's not worth gating this by a flag.

I think this is not true anymore.

Yes, that's a stale comment. Will fix.

--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
TIMING [ <replaceable class="parameter">boolean</replaceable> ]
SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
+    IO [ <replaceable class="parameter">boolean</replaceable> ]
FORMAT { TEXT | XML | JSON | YAML }
</synopsis>
</refsynopsisdiv>
@@ -298,6 +299,17 @@ ROLLBACK;
</listitem>
</varlistentry>

+   <varlistentry>
+    <term><literal>IO</literal></term>
+    <listitem>
+     <para>
+      Include information on I/O performed by each node.
+      This parameter may only be used when <literal>ANALYZE</literal> is also
+      enabled.  It defaults to <literal>TRUE</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
<varlistentry>
<term><literal>FORMAT</literal></term>
<listitem>

I'm not sure either. I initially added it as default OFF but it's true
it's similar to "buffers".

@@ -1198,6 +1198,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_nkeys = nkeys;
scan->rs_base.rs_flags = flags;
scan->rs_base.rs_parallel = parallel_scan;
+ scan->rs_base.rs_instrument = NULL;
scan->rs_strategy = NULL; /* set in initscan */
scan->rs_cbuf = InvalidBuffer;

@@ -1267,6 +1268,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,

scan->rs_read_stream = NULL;
+	/* allocate instrumentation */
+	if (flags & SO_SCAN_INSTRUMENT)
+		scan->rs_base.rs_instrument = palloc0_object(TableScanInstrumentation);
+	else
+		scan->rs_base.rs_instrument = NULL;
+
/*
* Set up a read stream for sequential scans and TID range scans. This
* should be done after initscan() because initscan() allocates the
@@ -1296,7 +1303,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
MAIN_FORKNUM,
cb,
scan,
-														  0);
+														  0,
+														  &scan->rs_base.rs_instrument->io);
}
else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
{
@@ -1307,7 +1315,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
MAIN_FORKNUM,
bitmapheap_stream_read_next,
scan,
-														  sizeof(TBMIterateResult));
+														  sizeof(TBMIterateResult),
+														  &scan->rs_base.rs_instrument->io);
}
scan->rs_vmbuffer = InvalidBuffer;
It doesn't seem great to dereference rs_base.rs_instrument if it's NULL, even
if that works because you then just compute the address of that. I think
you'll end up passing the offset of io in TableScanInstrumentation this way.

Yes, I have already noticed / fixed this in my development branch.

+static void
+print_io_usage(ExplainState *es, IOStats *stats)
+{
+	/* don't print stats if there's nothing to report */
+	if (stats->prefetch_count > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			/* prefetch distance info */
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
+							 (stats->distance_sum * 1.0 / stats->prefetch_count),
+							 stats->distance_max,
+							 stats->distance_capacity);
+			appendStringInfoChar(es->str, '\n');
+
+			/* prefetch I/O info (only if there were actual I/Os) */
+			if (stats->io_count > 0)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
+								 stats->stall_count);
+
+				appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
+								 (stats->io_nblocks * 1.0 / stats->io_count),
+								 (stats->io_in_progress * 1.0 / stats->io_count));
+
+				appendStringInfoChar(es->str, '\n');
+			}

Why have these separate appendStringInfoChar's if you just could just include
the \n in the prior appendStringInfo?

Not sure, IIRC it went through multiple versions and at some point the
lines were far too long (esp. with histograms etc.).

Separately, is %.3f really the right thing? When is that degree of precision
useful here?

No idea. I think %.1f would be enough. We use %.2f in a couple places,
but that's not a huge difference.

diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 7e2c1b7467b..d181fd7b8ff 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -104,6 +104,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
TBMIterator tbmiterator = {0};
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
+	EState *estate = node->ss.ps.state;

if (!pstate)
{
@@ -146,7 +147,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
{
node->ss.ss_currentScanDesc =
table_beginscan_bm(node->ss.ss_currentRelation,
-							   0,
+							   (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
node->ss.ps.state->es_snapshot,
0,
NULL);
@@ -329,6 +330,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
si->exact_pages += node->stats.exact_pages;
si->lossy_pages += node->stats.lossy_pages;
+
+		/* collect prefetch info for this process from the read_stream */
+		ACCUMULATE_IO_STATS(&si->stats.io, &node->ss.ss_currentScanDesc->rs_instrument->io);
}

I see no point in having ACCUMULATE_IO_STATS be a macro instead of a static
inline?

Both a macro and static inline would work for me.

+/*
+ * Update stream stats with current pinned buffer depth.
+ *
+ * Called once per buffer returned to the consumer in read_stream_next_buffer().
+ * Records the number of pinned buffers at that moment, so we can compute the
+ * average look-ahead depth.
+ */
+static inline void
+read_stream_update_stats_prefetch(ReadStream *stream)
+{
+	IOStats *stats = stream->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->prefetch_count++;
+	stats->distance_sum += stream->pinned_buffers;
+	if (stream->pinned_buffers > stats->distance_max)
+		stats->distance_max = stream->pinned_buffers;
+}
+
+/*
+ * Update stream stats about size of I/O requests.
+ *
+ * We count the number of I/O requests, size of requests (counted in blocks)
+ * and number of in-progress I/Os.
+ */
+static inline void
+read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
+{
+	IOStats *stats = stream->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->io_count++;
+	stats->io_nblocks += nblocks;
+	stats->io_in_progress += in_progress;
+}
+
+static inline void
+read_stream_update_stats_stall(ReadStream *stream)
+{

Wonder if we should rename "stall" to "wait", seems like it might be easier to
understand for a more general audience.

I agree "wait" sounds clearer.

/* ----------------------------------------------------------------
@@ -404,9 +471,50 @@ void
ExecSeqScanInitializeWorker(SeqScanState *node,
ParallelWorkerContext *pwcxt)
{
+ EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
+ char *ptr;
+ Size size;
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan);
+		table_beginscan_parallel(node->ss.ss_currentRelation,
+								 (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
+								 pscan);
+
+	/*
+	 * Workers don't get the pscan_len value in scan descriptor, so use the
+	 * TAM callback again. The result has to match the earlier result in
+	 * ExecSeqScanEstimate.
+	 */
+	size = table_parallelscan_estimate(node->ss.ss_currentRelation,
+									   estate->es_snapshot);
+
That seems quite grotty...

Yes. But what's a better solution?

+	ptr = (char *) pscan;
+	ptr += MAXALIGN(size);
+
+	if (node->ss.ps.instrument)
+		node->sinstrument = (SharedSeqScanInstrumentation *) ptr;
Seems like the appropriate instrumentation pointers should be computed by the
leader, not in the workers.

This is how BitmapHeapScan does it. And how would the leader do that,
when the node is private in the worker?

diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 97cb7206676..3fa6d838545 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -93,112 +93,126 @@ select explain_filter('explain (analyze, buffers, io off, format text) select *
Execution Time: N.N ms
(3 rows)
-select explain_filter('explain (analyze, buffers, io off, format xml) select * from int8_tbl i8');
- explain_filter
---------------------------------------------------------
- <explain xmlns="http://www.postgresql.org/N/explain"> +
...

+ <explain xmlns="http://www.postgresql.org/N/explain"> +

...

Seems these tests should really use unaligned output, to avoid this stupid
"every line changes despite not really anything changing" type of diff.

Probably. I can adjust that in an initial patch.

regards

--
Tomas Vondra

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

7 days ago

In reply to: Tomas Vondra (#18)

Re: EXPLAIN: showing ReadStream / prefetch stats

Here's v6, addressing most of the review comments. Most of this is
pretty simple / non-contentious. The remaining questions are in the last
two items (phs_len field, initializing the instrumentation pointers):

- 0001 tweaks WaitReadBuffers to return if a wait was needed. As pointed
out, this does not include some changes needed for io_uring, I don't
want to make this larger. So test with "worker".

- 0002 switches a couple explain tests to unaligned output. This makes
it easier to understand output changes with yaml/json/xml output.

- I kept the three scans in separate changes. It makes it easier to
review etc. and to me it seems easier to review. I'd keep it like this
for commit, but if someone thinks we should squash it into a single
commit that's possible too.

- I kept the IO option enabled by default. I think it makes sense to
keep this aligned with BUFFERS, but on the other hand the explain should
not be overwhelming. Various other explain options are similarly
helpful, and yet we left them OFF by default.

- Cleanup of the explain formatting code (merging appendString lines),
and so on. One question is whether to keep printing the information only
when (stats->prefetch_count > 0), especially in the non-text cases.

With the three scans this is not an issue, because the read stream is
initialized right away. But with index scans we delay that a bit, so I
wonder if this might cause some test instability. Probably not.

- I switched to %.2f format for printing the averages etc. instead of
%.3f. Not a huge change, we could probably go even to %.1f ...

- Replaced the ACCUMULATE_IO_STATS macro with a static inline.

- Renamed "stalls" to "waits".

- Fixed a couple obsolete comments, and the bug in dereferencing NULL
when the instrumentation was not initialized.

- Added the pscan_len length (returned by table_parallelscan_estimate)
to ParallelTableScanDesc, so that the workers can use this to calculate
the pointer of shared instrumentation without calling the _estimate
again (which I agree seems wrong and grotty).

This only matters for SeqScan and TidRangeScan, and it makes it quite a
bit cleaner. I don't see a better way. Right now the nodes set the new
field right after calling shm_toc_allocate, but I suppose it'd be more
correct to pass it to table_parallelscan_initialize() and set it from
there. Barring objections I'll adjust this for v7.

- I still don't understand how could the leader calculate the
instrumentation pointers for workers, so it's done the same way as in
BHS (except that we need to use the new phs_len field to calculate the
initial offset, of course).

regards

--
Tomas Vondra

#20

Andres Freund

andres@anarazel.de

7 days ago

In reply to: Tomas Vondra (#18)

Re: EXPLAIN: showing ReadStream / prefetch stats

Hi,

On 2026-03-30 20:21:29 +0200, Tomas Vondra wrote:

From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 19 Mar 2026 22:25:09 +0100
Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE)

This adds details about AIO / prefetch for executor nodes using a
ReadStream. Right now this applies only to BitmapHeapScan, because
that's the only scan node using a ReadStream and collecting
instrumentation from workers.

I don't understand why that means it should be done as part of this commit,
whereas seqscans shouldn't?

Are you suggesting the commit adds support for all those scans (BHS,
SeqScan and TidRangeScan) or none of them?

I guess I mostly just didn't quite understand what differentiates bitmap scans
from the other scans, based on this explanation.

To me it seems better to have at least some scan because of testing. But
SeqScan/TidRangeScan don't have the instrumentation infrastructure for
parallel queries, and I don't want to do that in the main patch - it seems
rather unrelated. And I also don't want to add it before the main patch.

I'd probably do the latter, i.e. add it before the main patch. Or at least
separately from the change to show read stream instrumentation.

Separately, is %.3f really the right thing? When is that degree of precision
useful here?

No idea. I think %.1f would be enough. We use %.2f in a couple places,
but that's not a huge difference.

I'd probably use .1f here, I don't think you'd ever need more
precision. Showing that it's not a round number is useful, but more than
that... ?

/* ----------------------------------------------------------------
@@ -404,9 +471,50 @@ void
ExecSeqScanInitializeWorker(SeqScanState *node,
ParallelWorkerContext *pwcxt)
{
+ EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
+ char *ptr;
+ Size size;
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan);
+		table_beginscan_parallel(node->ss.ss_currentRelation,
+								 (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
+								 pscan);
+
+	/*
+	 * Workers don't get the pscan_len value in scan descriptor, so use the
+	 * TAM callback again. The result has to match the earlier result in
+	 * ExecSeqScanEstimate.
+	 */
+	size = table_parallelscan_estimate(node->ss.ss_currentRelation,
+									   estate->es_snapshot);
+
That seems quite grotty...
Yes. But what's a better solution?