Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

Started by Radim Marek4 months ago12 messageshackers

radim@boringsql.com

4 months ago

Hello,

First let me say this is a first time bug report and trying to patch the
PostgreSQL, fresh after the hacking webinar. Please let me know if I need
to clarify something.

TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the Sort
node between first and subsequent executions per session.

Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1
width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1
width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

I found it while working on SQL regression tool (regresql) that uses total
reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations for
query performance.

The issue came on smaller queries where test cases showed consistent
"improvement" on performance (fewer buffers) when comparing baseline and
actual test runs. Previously did not notice it on larger data sets due to
the difference in buffers being negligible given the larger operations.

The example is the test case I managed to isolate.

--- way how to reproduce

CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int);
INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
ANALYZE sort_buffer_test;

-- run twice in new psql session and observe reported buffer change
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;

---

This behaviour is session specific, tested on 17.x, 18.x and current
development version (just fiy, all verified on both amd64 and aarch64).

I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc
lookups. The specific lookups vary by operation but the pattern is the
same: first execution incurs catalog reads that get cached for subsequent
runs. This applies to DISTINCT (sort, not hashed), GROUP BY, window
functions, etc - which seems to isolate it to sort node.

All those I believe are actually metadata resolution (i.e. planner already
knows what to sort by).

Having said that - I'm happy to try to fix it. Just wondering what is the
right direction to go:

1. Pre-warm syscache during planning (tried locally and it works - but does
not feel conceptually right and it's rather 'hacky')
2. Pre-compute and store metadata and modify each plan node struct for each
type (lots of work). I believe that's what MergeJoin already does.
3. Or this is 'works as expected' - which I don't think it's the case and
would force me to do pre-warming on regresql side

Kind regards,

Radim

David Rowley

dgrowleyml@gmail.com

4 months ago

In reply to: Radim Marek (#1)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com> wrote:

Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

vs

Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc lookups. The specific lookups vary by operation but the pattern is the same: first execution incurs catalog reads that get cached for subsequent runs. This applies to DISTINCT (sort, not hashed), GROUP BY, window functions, etc - which seems to isolate it to sort node.

I'm finding it hard to fathom why you think this is a bug. We have
various caches that will require accessing various catalogue tables
and probably indexes too, which will be accessed unless the cache has
already been populated. These accessing the buffers for these are
included in the buffers count in EXPLAIN.

What is it you expect to happen here? If we access buffers and don't
report them, then IMO, that's a bug.

David

Radim Marek

radim@boringsql.com

4 months ago

In reply to: David Rowley (#2)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On Tue, 3 Feb 2026 at 00:37, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com> wrote:

Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1

width=32) (actual time=0.042..0.126 rows=1.00 loops=1)

Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

vs

Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1

width=32) (actual time=0.021..0.057 rows=1.00 loops=1)

Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

I traced the potential issue down in case of ORDER BY to

pg_amop/pg_amproc lookups. The specific lookups vary by operation but the
pattern is the same: first execution incurs catalog reads that get cached
for subsequent runs. This applies to DISTINCT (sort, not hashed), GROUP BY,
window functions, etc - which seems to isolate it to sort node.

I'm finding it hard to fathom why you think this is a bug. We have
various caches that will require accessing various catalogue tables
and probably indexes too, which will be accessed unless the cache has
already been populated. These accessing the buffers for these are
included in the buffers count in EXPLAIN.

What is it you expect to happen here? If we access buffers and don't
report them, then IMO, that's a bug.

David

Apologies, "bug" might be a bit harsh and didn't want to imply those
buffers won't be reported. My assumption is that catalogue lookups for
resolving sort operator metadata would be reported as planning buffers
rather than execution buffers. This is already what's reported for other
lookups for very same query above.

The executor re-resolves sort operator metadata that the planner already
looked up. The first lookup counts as planning buffers, the second as
execution buffers.

Would passing that already resolved opfamily from the planner be a
reasonable direction? My reasoning in this case is that I'm building on the
hypothesis that buffers (whether shared hit or read) should be
deterministic for the same query on the same data, and the same plan. This
re-resolution on first execution breaks it.

Radim

Tomas Vondra

tomas.vondra@2ndquadrant.com

4 months ago

In reply to: Radim Marek (#1)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On 2/2/26 23:54, Radim Marek wrote:

Hello,

First let me say this is a first time bug report and trying to patch the
PostgreSQL, fresh after the hacking webinar. Please let me know if I
need to clarify something.

TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the
Sort node between first and subsequent executions per session.

Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1
width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com
<mailto:email_1@example.com>'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

vs

Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1
width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com
<mailto:email_1@example.com>'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

I found it while working on SQL regression tool (regresql) that uses
total reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations
for query performance.

The issue came on smaller queries where test cases showed consistent
"improvement" on performance (fewer buffers) when comparing baseline and
actual test runs. Previously did not notice it on larger data sets due
to the difference in buffers being negligible given the larger operations.

The example is the test case I managed to isolate.
--- way how to reproduce
CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int);
INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
ANALYZE sort_buffer_test;

-- run twice in new psql session and observe reported buffer change
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;

---

This behaviour is session specific, tested on 17.x, 18.x and current
development version (just fiy, all verified on both amd64 and aarch64).

I traced the potential issue down in case of ORDER BY to pg_amop/
pg_amproc lookups. The specific lookups vary by operation but the
pattern is the same: first execution incurs catalog reads that get
cached for subsequent runs. This applies to DISTINCT (sort, not hashed),
GROUP BY, window functions, etc - which seems to isolate it to sort node.

All those I believe are actually metadata resolution (i.e. planner
already knows what to sort by).

Correct. AFAICS this happens because the execution may need additional
metadata that was not needed for planning. In this particular case the
planning checks that the operators are valid/compatible, but does not
lookup all the associated pg_amproc entries. That only happens during
execution, when actually constructing the "tuplesort" object.

But there are probably many similar cases.

Having said that - I'm happy to try to fix it. Just wondering what is
the right direction to go:

1. Pre-warm syscache during planning (tried locally and it works - but
does not feel conceptually right and it's rather 'hacky')

Yeah, not going to happen. We'd have to preload the whole syscache,
because during planning we don't even know if that particular path/plan
will win. So we only do the bare minimum, to minimize the cost of
planning. And then we initialize the additional fields later, once we
have the cheapest plan.

2. Pre-compute and store metadata and modify each plan node struct for
each type (lots of work). I believe that's what MergeJoin already does.

I suppose MergeJoin may simply need more metadata during planning. If it
does more than that, it might count as a bug (and it should do less).

3. Or this is 'works as expected' - which I don't think it's the case
and would force me to do pre-warming on regresql side

I think it mostly works as expected/designed. That however does not mean
the issue you pointed out does not exist. IMHO what regresql is aiming
to do would be quite handy - e.g. in the index prefetching patch we have
regression tests checking that we're not accessing unexpected number of
buffers (or more buffers than master).

I'm planning to check if regresql would be a more elegant way to do
that. For now we're relying on plain regression tests (SQL + expected
output), and that happens to not have this issue because it executes the
queries in a session with a "deterministic" state.

regards

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

4 months ago

In reply to: Radim Marek (#3)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On 2/3/26 08:21, Radim Marek wrote:

On Tue, 3 Feb 2026 at 00:37, David Rowley <dgrowleyml@gmail.com
<mailto:dgrowleyml@gmail.com>> wrote:

On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com
<mailto:radim@boringsql.com>> wrote:

Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1

width=32) (actual time=0.042..0.126 rows=1.00 loops=1)

Filter: (lower(email) = 'email_1@example.com

<mailto:email_1@example.com>'::text)

Rows Removed by Filter: 99
Buffers: shared hit=1

vs

Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1

width=32) (actual time=0.021..0.057 rows=1.00 loops=1)

Filter: (lower(email) = 'email_1@example.com

<mailto:email_1@example.com>'::text)

Rows Removed by Filter: 99
Buffers: shared hit=1

I traced the potential issue down in case of ORDER BY to pg_amop/

pg_amproc lookups. The specific lookups vary by operation but the
pattern is the same: first execution incurs catalog reads that get
cached for subsequent runs. This applies to DISTINCT (sort, not
hashed), GROUP BY, window functions, etc - which seems to isolate it
to sort node.

I'm finding it hard to fathom why you think this is a bug. We have
various caches that will require accessing various catalogue tables
and probably indexes too, which will be accessed unless the cache has
already been populated. These accessing the buffers for these are
included in the buffers count in EXPLAIN.

What is it you expect to happen here? If we access buffers and don't
report them, then IMO, that's a bug.

David

Apologies, "bug" might be a bit harsh and didn't want to imply those
buffers won't be reported. My assumption is that catalogue lookups for
resolving sort operator metadata would be reported as planning buffers
rather than execution buffers. This is already what's reported for other
lookups for very same query above.

I wouldn't call this a "bug", but the behavior may be a bit surprising
and working against what regresql aims to do.

The executor re-resolves sort operator metadata that the planner already
looked up. The first lookup counts as planning buffers, the second as
execution buffers.

I don't think it re-resolves the metadata. AFAIK it does fetch
additional information about the opclass, which was not needed for planning.

Would passing that already resolved opfamily from the planner be a
reasonable direction? My reasoning in this case is that I'm building on
the hypothesis that buffers (whether shared hit or read) should be
deterministic for the same query on the same data, and the same plan.
This re-resolution on first execution breaks it.

No, not really. That would increase the cost of planning - we do need to
do the lookup later, but we only do that for the one plan that "won". If
we did that during planning, it'd have to happen for all plans.

Perhaps we could identify buffer accesses from syscache lookups, and
track those separately? Then we'd know which of the hits/reads are from
"actual" execution.

regards

--
Tomas Vondra

Lukas Fittl

lukas@fittl.com

4 months ago

In reply to: Tomas Vondra (#5)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On Tue, Feb 3, 2026 at 3:02 AM Tomas Vondra <tomas@vondra.me> wrote:

Perhaps we could identify buffer accesses from syscache lookups, and
track those separately? Then we'd know which of the hits/reads are from
"actual" execution.

Doing this in the current infrastructure (with the global
pgBufferUsage counters) is pretty expensive I think, since we'd need
to do an extra call to BufferUsageAccumDiff after syscache lookups, to
get the buffer usage just for that activity.

If splitting out buffers by syscache vs others is of interest, I would
welcome review of the stack-based instrumentation patch over at [0]/messages/by-id/CAP53Pkw85U-aMRzkZ+kRKfCh0pA5vyo=_W16gPK4sirZxiJy=A@mail.gmail.com,
which has an example of how it can be used to more easily break Index
Scans out into index and table access - the same logic could be
applied to syscache lookups I think.

[0]: /messages/by-id/CAP53Pkw85U-aMRzkZ+kRKfCh0pA5vyo=_W16gPK4sirZxiJy=A@mail.gmail.com

Thanks,
Lukas

--
Lukas Fittl

David Rowley

dgrowleyml@gmail.com

4 months ago

In reply to: Radim Marek (#3)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On Tue, 3 Feb 2026 at 20:21, Radim Marek <radim@boringsql.com> wrote:

Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is that catalogue lookups for resolving sort operator metadata would be reported as planning buffers rather than execution buffers. This is already what's reported for other lookups for very same query above.

The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as planning buffers, the second as execution buffers.

If it was looking up the same thing, then there shouldn't be a cache
miss, and the buffers won't be accessed.

Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case is that I'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same query on the same data, and the same plan. This re-resolution on first execution breaks it.

I think generally, offloading more work into the planner for looking
up things that are always the same for every invocation of a plan is
generally good. For anything that moves things more in that direction,
you'd need to ensure that the plan is correctly invalidated when
something changes about the extra thing you looked up in planning.
I've not looked into the specifics of this instance, other than the
function being called which causes the buffer usage is
get_opfamily_proc(). I don't see that called during planning for this
query, so your statement about "already resolved opfamily from the
planner" doesn't seem true.

However, I imagine this is just one of many things and if you're
aiming to stabilise your tests doing this, then you'll likely be
fixing things for a long time to come. I imagine a better way if
you're just doing this for running test suites would be to invent some
way to prepopulate the sys and cat caches. Maybe that could be a
contrib module. I don't think there's a way to get a relcache miss
during execution, but I guess it could change with prepared statements
if we ever decided to start evicting long unused relcache entries one
day.

David

Tomas Vondra

tomas.vondra@2ndquadrant.com

4 months ago

In reply to: David Rowley (#7)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On 2/3/26 22:53, David Rowley wrote:

On Tue, 3 Feb 2026 at 20:21, Radim Marek <radim@boringsql.com> wrote:

Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is that catalogue lookups for resolving sort operator metadata would be reported as planning buffers rather than execution buffers. This is already what's reported for other lookups for very same query above.

The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as planning buffers, the second as execution buffers.

If it was looking up the same thing, then there shouldn't be a cache
miss, and the buffers won't be accessed.

Right.

I think there's a bit of confusion because the planning phase resolves
only some of the information, and then some additional pieces are left
to the execution phase. Which won't be counted in "planning" buffers,
and that may seem surprising to users.

Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case is that I'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same query on the same data, and the same plan. This re-resolution on first execution breaks it.

I think generally, offloading more work into the planner for looking
up things that are always the same for every invocation of a plan is
generally good. For anything that moves things more in that direction,
you'd need to ensure that the plan is correctly invalidated when
something changes about the extra thing you looked up in planning.
I've not looked into the specifics of this instance, other than the
function being called which causes the buffer usage is
get_opfamily_proc(). I don't see that called during planning for this
query, so your statement about "already resolved opfamily from the
planner" doesn't seem true.

Yes, that's essentially what I explained yesterday.

I'm not sure about moving all these lookups to the planner (from the
executor). Could this easily hurt some cases? We'd need to do this for
all fields the final plan might need, i.e. we'd need to initialize all
fields for all paths we construct - not just the metadata needed by the
final cheapest one. That could be a lot of additional lookups, right?
That would make the first execution more expensive. Later executions of
a query that would be cached, but we'd be still paying the cost for the
lookup (cheap).

Of course, there's a lot of such fields - this amproc is just an
example. I don't think it's practical to rework all of that to
initialize everything early.

However, I imagine this is just one of many things and if you're
aiming to stabilise your tests doing this, then you'll likely be
fixing things for a long time to come. I imagine a better way if
you're just doing this for running test suites would be to invent some
way to prepopulate the sys and cat caches. Maybe that could be a
contrib module. I don't think there's a way to get a relcache miss
during execution, but I guess it could change with prepared statements
if we ever decided to start evicting long unused relcache entries one
day.

I'm not sure there's a good way to warmup the sys/cat caches, short of
loading everything. Because how would you know what will be needed by
the execution plan? It seems very complex.

One option would be to run the queries twice - the first one would
warmup caches, the second execution would be the measured one. But that
makes the tool 2x as expensive.

But what if we had a way to *invalidate* all the caches? That'd also
make the behavior deterministic - there would be no hits initially. And
it seems much simpler to implement.

regards

--
Tomas Vondra

David Rowley

dgrowleyml@gmail.com

4 months ago

In reply to: Tomas Vondra (#8)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On Thu, 5 Feb 2026 at 04:59, Tomas Vondra <tomas@vondra.me> wrote:

On 2/3/26 22:53, David Rowley wrote:

I think generally, offloading more work into the planner for looking
up things that are always the same for every invocation of a plan is
generally good. For anything that moves things more in that direction,
you'd need to ensure that the plan is correctly invalidated when
something changes about the extra thing you looked up in planning.
I've not looked into the specifics of this instance, other than the
function being called which causes the buffer usage is
get_opfamily_proc(). I don't see that called during planning for this
query, so your statement about "already resolved opfamily from the
planner" doesn't seem true.

Yes, that's essentially what I explained yesterday.

I'm not sure about moving all these lookups to the planner (from the
executor). Could this easily hurt some cases? We'd need to do this for
all fields the final plan might need, i.e. we'd need to initialize all
fields for all paths we construct - not just the metadata needed by the
final cheapest one. That could be a lot of additional lookups, right?
That would make the first execution more expensive. Later executions of
a query that would be cached, but we'd be still paying the cost for the
lookup (cheap).

We do look some things up at createplan time. e.g calling
get_opfamily_member_for_cmptype() from prepare_sort_from_pathkeys().
However, that could still be an additional lookup as the existing
lookup for the case in question is in ExecSort rather than in
ExecInitSort(), so there could be a surplus lookup if the sort node is
never executed. I doubt that's worth worrying about too much. It's
normally large memory allocations we want to defer until fetching the
first row from the node.

Of course, there's a lot of such fields - this amproc is just an
example. I don't think it's practical to rework all of that to
initialize everything early.

However, I imagine this is just one of many things and if you're
aiming to stabilise your tests doing this, then you'll likely be
fixing things for a long time to come. I imagine a better way if
you're just doing this for running test suites would be to invent some
way to prepopulate the sys and cat caches. Maybe that could be a
contrib module. I don't think there's a way to get a relcache miss
during execution, but I guess it could change with prepared statements
if we ever decided to start evicting long unused relcache entries one
day.

I'm not sure there's a good way to warmup the sys/cat caches, short of
loading everything. Because how would you know what will be needed by
the execution plan? It seems very complex.

I imagined if it's just for machines running tests then you could just
load everything. If it was coded in such a way that a tuple fetched by
doing a Seq Scan on the catalogue table was what went into the cache,
rather than the Seq Scan drives the normal cache lookup code,
resulting in a subsequent Index Scan on the catalogue's index, then it
could be done with fairly low overhead. I imagine in the order of
<10ms from fresh initdb. That doesn't seem excessively long for
machines running tests in the background.

One option would be to run the queries twice - the first one would
warmup caches, the second execution would be the measured one. But that
makes the tool 2x as expensive.

Maybe. I don't know the tool or how people use it.

But what if we had a way to *invalidate* all the caches? That'd also
make the behavior deterministic - there would be no hits initially. And
it seems much simpler to implement.

There is debug_discard_caches, but that requires an assert-enabled build.

David

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

4 months ago

In reply to: David Rowley (#9)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

On 2/5/26 01:15, David Rowley wrote:

On Thu, 5 Feb 2026 at 04:59, Tomas Vondra <tomas@vondra.me> wrote:

On 2/3/26 22:53, David Rowley wrote:

I think generally, offloading more work into the planner for looking
up things that are always the same for every invocation of a plan is
generally good. For anything that moves things more in that direction,
you'd need to ensure that the plan is correctly invalidated when
something changes about the extra thing you looked up in planning.
I've not looked into the specifics of this instance, other than the
function being called which causes the buffer usage is
get_opfamily_proc(). I don't see that called during planning for this
query, so your statement about "already resolved opfamily from the
planner" doesn't seem true.

Yes, that's essentially what I explained yesterday.

I'm not sure about moving all these lookups to the planner (from the
executor). Could this easily hurt some cases? We'd need to do this for
all fields the final plan might need, i.e. we'd need to initialize all
fields for all paths we construct - not just the metadata needed by the
final cheapest one. That could be a lot of additional lookups, right?
That would make the first execution more expensive. Later executions of
a query that would be cached, but we'd be still paying the cost for the
lookup (cheap).

We do look some things up at createplan time. e.g calling
get_opfamily_member_for_cmptype() from prepare_sort_from_pathkeys().
However, that could still be an additional lookup as the existing
lookup for the case in question is in ExecSort rather than in
ExecInitSort(), so there could be a surplus lookup if the sort node is
never executed. I doubt that's worth worrying about too much. It's
normally large memory allocations we want to defer until fetching the
first row from the node.

Perhaps, you may be right about this. I'm still skeptical about having
to figure out all the fields we'll need during execution. I doubt we'll
go around and adjust all those places, so the discussion about the cost
of additional lookups are purely theoretical.

Of course, there's a lot of such fields - this amproc is just an
example. I don't think it's practical to rework all of that to
initialize everything early.

However, I imagine this is just one of many things and if you're
aiming to stabilise your tests doing this, then you'll likely be
fixing things for a long time to come. I imagine a better way if
you're just doing this for running test suites would be to invent some
way to prepopulate the sys and cat caches. Maybe that could be a
contrib module. I don't think there's a way to get a relcache miss
during execution, but I guess it could change with prepared statements
if we ever decided to start evicting long unused relcache entries one
day.

I'm not sure there's a good way to warmup the sys/cat caches, short of
loading everything. Because how would you know what will be needed by
the execution plan? It seems very complex.

I imagined if it's just for machines running tests then you could just
load everything. If it was coded in such a way that a tuple fetched by
doing a Seq Scan on the catalogue table was what went into the cache,
rather than the Seq Scan drives the normal cache lookup code,
resulting in a subsequent Index Scan on the catalogue's index, then it
could be done with fairly low overhead. I imagine in the order of
<10ms from fresh initdb. That doesn't seem excessively long for
machines running tests in the background.

So we'd just go through all the caches relcaches/catcaches/... and load
all the stuff that's in pg_catalog? I guess that could work, although
I'm not sure how convenient would it be for objects created in the tests
themselves (I'm not sure if those end up in the cache right away).

One option would be to run the queries twice - the first one would
warmup caches, the second execution would be the measured one. But that
makes the tool 2x as expensive.

Maybe. I don't know the tool or how people use it.

But what if we had a way to *invalidate* all the caches? That'd also
make the behavior deterministic - there would be no hits initially. And
it seems much simpler to implement.

There is debug_discard_caches, but that requires an assert-enabled build.

Right. I suppose we could have a SQL function calling
InvalidateSystemCachesExtended to do that. In fact, an extension can
probably add such function already.

regards

--
Tomas Vondra

#11

Tom Lane

tgl@sss.pgh.pa.us

4 months ago

In reply to: Tomas Vondra (#10)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

Tomas Vondra <tomas@vondra.me> writes:

On 2/5/26 01:15, David Rowley wrote:

I imagined if it's just for machines running tests then you could just
load everything. If it was coded in such a way that a tuple fetched by
doing a Seq Scan on the catalogue table was what went into the cache,
rather than the Seq Scan drives the normal cache lookup code,
resulting in a subsequent Index Scan on the catalogue's index, then it
could be done with fairly low overhead. I imagine in the order of
<10ms from fresh initdb. That doesn't seem excessively long for
machines running tests in the background.

So we'd just go through all the caches relcaches/catcaches/... and load
all the stuff that's in pg_catalog? I guess that could work,

... until there's a cache flush event. This whole discussion seems
to me to be based on a misconception.

regards, tom lane

#12

Radim Marek

radim@boringsql.com

4 months ago

In reply to: Tom Lane (#11)

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

Thank you all for the time and detailed responses. To clarify - RegreSQL
works with tolerances and I only came across this when detecting
improvements (i.e. opposite of the regressions), where smaller executions
showed consistent "gains" that turned out to be cache
related. Understanding the complexity of the underlying cause is in a way
enough for me - there's much more depth here than I initially anticipated.
My warm up fix attempts led me to believe it might be much easier.

I'll be honest - just understanding and reviewing Lukas' patches will take
me a few weeks to properly digest. But at the end it's exactly the kind of
deeper understanding I was hoping to build by engaging here.

I really do appreciate the patience with a first-time poster :)

Radim

On Thu, 5 Feb 2026 at 16:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Tomas Vondra <tomas@vondra.me> writes:

On 2/5/26 01:15, David Rowley wrote:

I imagined if it's just for machines running tests then you could just
load everything. If it was coded in such a way that a tuple fetched by
doing a Seq Scan on the catalogue table was what went into the cache,
rather than the Seq Scan drives the normal cache lookup code,
resulting in a subsequent Index Scan on the catalogue's index, then it
could be done with fairly low overhead. I imagine in the order of
<10ms from fresh initdb. That doesn't seem excessively long for
machines running tests in the background.

So we'd just go through all the caches relcaches/catcaches/... and load
all the stuff that's in pg_catalog? I guess that could work,

... until there's a cache flush event. This whole discussion seems
to me to be based on a misconception.

regards, tom lane