Stack-based tracking of per-node WAL/buffer usage

Started by Lukas Fittl6 months ago16 messages

lukas@fittl.com

6 months ago

Hi,

Please find attached a patch series that introduces a new paradigm for how
per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
distinct buffer hits [0]See lightning talk slides from PGConf.Dev discussing an HLL-based EXPLAIN (BUFFERS DISTINCT): https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf.

Currently we utilize pgWalUsage/pgBufferUsage as global counters, and in
InstrStopNode we call the rather
expensive BufferUsageAccumDiff/WalUsageAccumDiff to know how much activity
happened within a given node cycle.

This proposal instead uses a stack, where each time we enter a node
(InstrStartNode) we point a new global (pgInstrStack) to the current stack
entry. Whilst we're in that node we increment buffer/WAL usage statistics
to the stack entry. On exit (InstrStopNode) we restore the previous entry.

This change provides about a 10% performance benefit for EXPLAIN ANALYZE on
paths that repeatedly enter InstrStopNode, e.g. SELECT COUNT(*):

CREATE TABLE test(id int);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);

master (124ms, best out of 3):

postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual
time=124.910..124.910 rows=1.00 loops=1)
Buffers: shared hit=752 read=3673
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual
time=0.201..62.228 rows=1000001.00 loops=1)
Buffers: shared hit=752 read=3673
Planning Time: 0.116 ms
Execution Time: 124.961 ms

patched (109ms, best out of 3):

postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual
time=109.788..109.788 rows=1.00 loops=1)
Buffers: shared hit=940 read=3485
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual
time=0.153..69.368 rows=1000001.00 loops=1)
Buffers: shared hit=940 read=3485
Planning Time: 0.134 ms
Execution Time: 109.837 ms
(6 rows)

I have also prototyped a more ambitious approach that completely removes
pgWalUsage/pgBufferUsage (utilizing the stack-collected data for e.g.
pg_stat_statements), but for now this patch set does not include that
change, but instead keeps adding to these legacy globals as well.

Patches attached:

0001: Separate node instrumentation from other use of Instrumentation struct

Previously different places (e.g. query "total time") were repurposing
the per-node Instrumentation struct. Instead, simplify the Instrumentation
struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
the NodeInstrumentation struct to carry forward the per-node
instrumentation information.

0002: Replace direct changes of pgBufferUsage/pgWalUsage with INSTR_* macros

0003: Introduce stack for tracking per-node WAL/buffer usage

Feedback/thoughts welcome!

CCing Andres since he had expressed interest in this off-list.

[0]: See lightning talk slides from PGConf.Dev discussing an HLL-based EXPLAIN (BUFFERS DISTINCT): https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf
EXPLAIN (BUFFERS DISTINCT):
https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf

Thanks,
Lukas

--
Lukas Fittl

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Lukas Fittl (#1)

Re: Stack-based tracking of per-node WAL/buffer usage

Hi,

On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:

Please find attached a patch series that introduces a new paradigm for how
per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
distinct buffer hits [0].

I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.

0001: Separate node instrumentation from other use of Instrumentation struct

Previously different places (e.g. query "total time") were repurposing
the per-node Instrumentation struct. Instead, simplify the Instrumentation
struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
the NodeInstrumentation struct to carry forward the per-node
instrumentation information.

It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).

0003: Introduce stack for tracking per-node WAL/buffer usage

From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 31 Aug 2025 16:37:05 -0700
Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b83ced9a57a..1c2268bc608 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
DestReceiver *dest;
bool		sendTuples;
MemoryContext oldcontext;
+	InstrStackResource *res;

/* sanity checks */
Assert(queryDesc != NULL);
@@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (queryDesc->totaltime)
InstrStart(queryDesc->totaltime);

+	/* Start up per-query node level instrumentation */
+	res = InstrStartQuery();
+
/*
* extract information from the query descriptor and the query feature.
*/
@@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (sendTuples)
dest->rShutdown(dest);

+	/* Shut down per-query node level instrumentation */
+	InstrShutdownQuery(res);
+
if (queryDesc->totaltime)
InstrStop(queryDesc->totaltime, estate->es_processed);

Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d286471254b..7436f307994 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)

/* Stop the node if we started it above, reporting 0 tuples. */
if (node->instrument && node->instrument->running)
+ {
InstrStopNode(node->instrument, 0);

+		/*
+		 * Propagate WAL/buffer stats to the parent node on the
+		 * instrumentation stack (which is where InstrStopNode returned us
+		 * to).
+		 */
+		InstrNodeAddToCurrent(&node->instrument->stack);
+	}
+
return false;
}

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

/*
* If we know we won't need to back up, we can release resources at this
* point.
*/
if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
ExecShutdownNode(planstate);

+static void
+ResOwnerReleaseInstrStack(Datum res)
+{
+	/*
+	 * XXX: Registered resources are *not* called in reverse order, i.e. we'll
+	 * get what was first registered first at shutdown. To avoid handling
+	 * that, we are resetting the stack here on abort (instead of recovering
+	 * to previous).
+	 */
+	pgInstrStack = NULL;
+}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Greetings,

Andres Freund

Lukas Fittl

lukas@fittl.com

6 months ago

In reply to: Andres Freund (#2)

Re: Stack-based tracking of per-node WAL/buffer usage

Hi Andres,

Thanks for the review!

Attached an updated patch set that addresses the feedback, and also adds
the complete removal of the global pgBufferUsage variable in later patches
(0005-0007), to avoid counting both the stack and the variable.

FWIW, pgWalUsage would also be nice to remove, but it has some interesting
interactions with the cumulative statistics system
and heap_page_prune_and_freeze, and seems less performance critical.

On Thu, Sep 4, 2025 at 1:23 PM Andres Freund <andres@anarazel.de> wrote:

0001: Separate node instrumentation from other use of Instrumentation

struct
...
It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).

Yeah, agreed. I added in a new 0001 patch that changes this to instr_time
consistently. I don't see a good reason to keep the transformation to
seconds in the Instrumentation logic, since all in-tree callers convert it
to milliseconds anyway.

0003: Introduce stack for tracking per-node WAL/buffer usage

Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.

Makes sense, I've adjusted that to be conditional (in the now renamed 0004
patch).

In the updated patch I've also decided to piggyback on QueryDesc totaltime
as the "owning" Instrumentation here for the query's lifetime. It seems
simpler that way and avoids having special purpose methods. To go along
with that I've changed the general purpose Instrumentation struct to use
stack-based instrumentation at the same time.

diff --git a/src/backend/executor/execProcnode.c
b/src/backend/executor/execProcnode.c
index d286471254b..7436f307994 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void
*context)
...

+ InstrNodeAddToCurrent(&node->instrument->stack);

...

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

You are of course correct, I didn't consider cursors correctly here.

It seems there isn't an existing executor node walk that could be
repurposed, so I added a new one in ExecutorFinish
("ExecAccumNodeInstrumentation"). From my read of the code there are no use
cases where we need aggregated instrumentation data before ExecutorFinish.

+static void
+ResOwnerReleaseInstrStack(Datum res)
+{
+     /*
+      * XXX: Registered resources are *not* called in reverse order,
i.e. we'll

+ * get what was first registered first at shutdown. To avoid

handling

+ * that, we are resetting the stack here on abort (instead of

recovering
+      * to previous).
+      */
+     pgInstrStack = NULL;
+}
Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally
triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Yes, good point, I did not consider that case. I've addressed this by only
updating the current stack if its not already a parent of the element being
released. We are also always adding the element's statistics to the
(updated) active stack element at abort.

Thanks,
Lukas

--
Lukas Fittl

Lukas Fittl

lukas@fittl.com

5 months ago

In reply to: Lukas Fittl (#3)

Re: Stack-based tracking of per-node WAL/buffer usage

On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas@fittl.com> wrote:

Attached an updated patch set that addresses the feedback, and also adds
the complete removal of the global pgBufferUsage variable in later patches
(0005-0007), to avoid counting both the stack and the variable.

See attached the same patch set rebased on latest master.

Thanks,
Lukas

--
Lukas Fittl

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Lukas Fittl (#4)

Re: Stack-based tracking of per-node WAL/buffer usage

On 2025-10-22 14:28:24 +0300, Lukas Fittl wrote:

On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas@fittl.com> wrote:

Attached an updated patch set that addresses the feedback, and also adds
the complete removal of the global pgBufferUsage variable in later patches
(0005-0007), to avoid counting both the stack and the variable.

See attached the same patch set rebased on latest master.

From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 31 Aug 2025 16:37:05 -0700
Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime,
require caller to convert

Previously the Instrumentation logic always converted to seconds, only for many
of the callers to do unnecessary division to get to milliseconds. Since an upcoming
refactoring will split the Instrumentation struct, utilize instrtime always to
keep things simpler.

LGTM, think we should apply this regardless of the rest of the patches.

@@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->running = true;
dst->firsttuple = add->firsttuple;
}
-	else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
+	else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
dst->firsttuple = add->firsttuple;

This isn't due to this patch, but it seems a bit odd that we use the minimum
time for the first tuple, but the average time for the node's completion...

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..646934020d1 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -184,6 +184,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_CMP_LT(x,y) \
+ ((x).ticks > (y).ticks)

#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.47.1

Any reason to actually have _CMP_ in the name? Other operations like _ADD
don't have such an additional verb in the name.

From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v3 2/7] Separate node instrumentation from other use of
Instrumentation struct

Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This dual use of the struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.

Instead, simplify the Instrumentation struct to only track time,
WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
outside of per-node instrumentation. Introduce the NodeInstrumentation
struct to carry forward the per-node instrumentation information.

@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);

- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)

Maybe add a comment about the removal of these InstrEndLoop() calls to the
commit message? If I understand correctly they were superfluous before, but
that's not entirely obvious when just looking at the patch.

+/*
+ * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
+{
+	/* Parameters set at creation: */
+	bool		need_timer;		/* true if we need timer data */
+	bool		need_bufusage;	/* true if we need buffer usage data */
+	bool		need_walusage;	/* true if we need WAL usage data */
+	/* Internal state keeping: */
+	instr_time	starttime;		/* start time of last InstrStart */
+	BufferUsage bufusage_start; /* buffer usage at start */
+	WalUsage	walusage_start; /* WAL usage at start */
+	/* Accumulated statistics: */
+	instr_time	total;			/* total runtime */
+	double		ntuples;		/* total tuples counted in InstrStop */
+	BufferUsage bufusage;		/* total buffer usage */
+	WalUsage	walusage;		/* total WAL usage */
+} Instrumentation;

Maybe add a comment explaining why ntuples is in here?

From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage

Could use a commit message :)

---
.../pg_stat_statements/pg_stat_statements.c | 4 +-
src/backend/commands/explain.c | 8 +-
src/backend/commands/trigger.c | 4 +-
src/backend/executor/execMain.c | 25 ++-
src/backend/executor/execProcnode.c | 29 +++
src/backend/executor/instrument.c | 199 ++++++++++++++----
src/include/executor/executor.h | 1 +
src/include/executor/instrument.h | 53 ++++-
8 files changed, 260 insertions(+), 63 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f43a33b3787..eeabd820d8e 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
+				   &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
+				   &INSTR_GET_WALUSAGE(queryDesc->totaltime),

Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(

Maybe it's worth refactoring this a bit in a precursor patch?

@@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
-			resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
+		{
+			if ((instrument_options & INSTRUMENT_TIMER) != 0)
+				resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
+			else
+				resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
+		}
}
else
{

I'd not duplicate the InstrAlloc(), but compute the flags separately.

/* ------------------------------------------------------------------------
@@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecAccumNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation stack).
+ */
+void
+ExecAccumNodeInstrumentation(PlanState *node)
+{
+	(void) ExecAccumNodeInstrumentation_walker(node, NULL);
+}

I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.

+static bool
+ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
+{
+	if (node == NULL)
+		return false;
+
+	check_stack_depth();
+
+	planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);

There already is a check_stack_depth() in planstate_tree_walker().

+	if (node->instrument && node->instrument->stack.previous)
+		InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
+
+	return false;
+}

E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too? It probably is ok, but it's worth explicitly checking and adding a
comment.

+/*
+ * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+	.name = "instrumentation",
+	.release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+	.release_priority = RELEASE_PRIO_FIRST,
+	.ReleaseResource = ResOwnerReleaseInstrumentation,
+	.DebugPrint = NULL,			/* default message is fine */
+};

Is there a reason to do the release here before the lock release? And why
_FIRST?

+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+	Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
+
+	/*
+	 * Because registered resources are *not* called in reverse order, we'll
+	 * get what was first registered first at shutdown. Thus, on any later
+	 * resources we need to not change the stack, which was already set to the
+	 * correct previous entry.
+	 */

FWIW, the release order is not guaranteed to be in that order either,
e.g. once resowner switches to hashing, it'll essentially be random.

+	if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
+		pgInstrStack = instr->stack.previous;

Hm - this is effectively O(stack-depth^2), right? It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?

+	/*
+	 * Always accumulate all collected stats before the abort, even if we
+	 * already walked up the stack with an earlier resource.
+	 */
+	if (pgInstrStack)
+		InstrStackAdd(pgInstrStack, &instr->stack);

Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...

/* General purpose instrumentation handling */
Instrumentation *
InstrAlloc(int n, int instrument_options)
{
-	Instrumentation *instr;
+	Instrumentation *instr = NULL;
+	bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+	bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+	bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+	int			i;
+
+	/*
+	 * If resource owner will be used, we must allocate in the transaction
+	 * context (not the calling context, usually a lower context), because the
+	 * memory might otherwise be freed too early in an abort situation.
+	 */
+	if (need_buffers || need_wal)
+		instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
+	else
+		instr = palloc0(n * sizeof(Instrumentation));

Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?

-	/* initialize all fields to zeroes, then modify as needed */
-	instr = palloc0(n * sizeof(Instrumentation));
-	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+	for (i = 0; i < n; i++)
{
-		bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
-		bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
-		bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
-		int			i;
-
-		for (i = 0; i < n; i++)
-		{
-			instr[i].need_bufusage = need_buffers;
-			instr[i].need_walusage = need_wal;
-			instr[i].need_timer = need_timer;
-		}
+		instr[i].need_bufusage = need_buffers;
+		instr[i].need_walusage = need_wal;
+		instr[i].need_timer = need_timer;
}

return instr;
}
+
void
InstrStart(Instrumentation *instr)
{
+ Assert(!instr->finalized);
+
if (instr->need_timer &&
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStart called twice in a row");

-	if (instr->need_bufusage)
-		instr->bufusage_start = pgBufferUsage;
-
-	if (instr->need_walusage)
-		instr->walusage_start = pgWalUsage;
+	if (instr->need_bufusage || instr->need_walusage)
+		InstrPushStackResource(instr);
}
+
void
-InstrStop(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr, double nTuples, bool finalize)
{
instr_time	endtime;

@@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}

-	/* Add delta of buffer usage since entry to node's totals */
-	if (instr->need_bufusage)
-		BufferUsageAccumDiff(&instr->bufusage,
-							 &pgBufferUsage, &instr->bufusage_start);
+	if (instr->need_bufusage || instr->need_walusage)
+		InstrPopStackResource(instr);

-	if (instr->need_walusage)
-		WalUsageAccumDiff(&instr->walusage,
-						  &pgWalUsage, &instr->walusage_start);
+	if (finalize)
+	{
+		instr->finalized = true;
+		if (pgInstrStack)
+			InstrStackAdd(pgInstrStack, &instr->stack);
+	}
}

/* Allocate new node instrumentation structure(s) */
@@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStartNode called twice in a row");

-	/* save buffer usage totals at node entry, if needed */
-	if (instr->need_bufusage)
-		instr->bufusage_start = pgBufferUsage;
+	if (instr->need_bufusage || instr->need_walusage)
+	{
+		/* Ensure that we always have a parent, even at the top most node */
+		Assert(pgInstrStack != NULL);

-	if (instr->need_walusage)
-		instr->walusage_start = pgWalUsage;
+		instr->stack.previous = pgInstrStack;
+		pgInstrStack = &instr->stack;
+	}
}

/* Exit from a plan node */
@@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}

-	/* Add delta of buffer usage since entry to node's totals */
-	if (instr->need_bufusage)
-		BufferUsageAccumDiff(&instr->bufusage,
-							 &pgBufferUsage, &instr->bufusage_start);
-
-	if (instr->need_walusage)
-		WalUsageAccumDiff(&instr->walusage,
-						  &pgWalUsage, &instr->walusage_start);
+	if (instr->need_bufusage || instr->need_walusage)
+	{
+		/* Ensure that we always have a parent, even at the top most node */
+		Assert(instr->stack.previous != NULL);
+		pgInstrStack = instr->stack.previous;
+	}

/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)

/* Add delta of buffer usage since entry to node's totals */
if (dst->need_bufusage)
-		BufferUsageAdd(&dst->bufusage, &add->bufusage);
+		BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);

if (dst->need_walusage)
-		WalUsageAdd(&dst->walusage, &add->walusage);
+		WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
+}
+
+void
+InstrStackAdd(InstrStack * dst, InstrStack * add)
+{
+	Assert(dst != NULL);
+	Assert(add != NULL);
+
+	BufferUsageAdd(&dst->bufusage, &add->bufusage);
+	WalUsageAdd(&dst->walusage, &add->walusage);
}

Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...

/* note current values during parallel executor startup */
@@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
void
InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
{
+	if (pgInstrStack != NULL)
+	{
+		InstrStack *dst = pgInstrStack;
+
+		BufferUsageAdd(&dst->bufusage, bufusage);
+		WalUsageAdd(&dst->walusage, walusage);
+	}
+
BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}

Is the pgInstrStack == NULL case actually reachable?

diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 78d3653997b..d04607ce40c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
#define INSTRUMENT_H

#include "portability/instr_time.h"
+#include "utils/resowner.h"

I'd probably not include resowner here but just forward declare the typedef.

@@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);

+#define INSTR_GET_BUFUSAGE(instr) \
+	instr->stack.bufusage
+
+#define INSTR_GET_WALUSAGE(instr) \
+	instr->stack.walusage

Not convinced that having these macros is worthwhile.

At this point I reached return -ENEEDCOFFEE :)

Greetings,

Andres Freund

Lukas Fittl

lukas@fittl.com

4 months ago

In reply to: Andres Freund (#5)

Re: Stack-based tracking of per-node WAL/buffer usage

Hi Andres,

Thanks for the detailed review!

Attached v4 patchset that addresses feedback (unless otherwise noted below)
and is rebased on master. Other changes:

- Ensured each patch is individually pgindent clean (and compiles)
- Refactored 0003 a bit to consistently use InstrPushStack/InstrPopStack
helpers for modifying the active stack entry
- Building on that refactoring, merged v3/0006 "Introduce alternate
Instrumentation stack mechanism relying on PG_FINALLY" into the main commit
introducing the stack mechanism (the "alternate" mechanism is just using
these helpers directly and making sure InstrPopStack is called via
PG_FINALLY, instead of using resource owners)
- Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
illustrates how the stack mechanism can be used to separate index and table
buffer accesses in the EXPLAIN for Index Scans

On Wed, Oct 22, 2025 at 5:59 AM Andres Freund <andres@anarazel.de> wrote:

+/*
+ * General purpose instrumentation that can capture time, WAL/buffer

usage and tuples

+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a

pair of

+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
+{
+     /* Parameters set at creation: */
+     bool            need_timer;             /* true if we need timer

data */

+ bool need_bufusage; /* true if we need buffer usage

data */

+ bool need_walusage; /* true if we need WAL usage data

+     /* Internal state keeping: */
+     instr_time      starttime;              /* start time of last

InstrStart */

+     BufferUsage bufusage_start; /* buffer usage at start */
+     WalUsage        walusage_start; /* WAL usage at start */
+     /* Accumulated statistics: */
+     instr_time      total;                  /* total runtime */
+     double          ntuples;                /* total tuples counted in

InstrStop */

+     BufferUsage bufusage;           /* total buffer usage */
+     WalUsage        walusage;               /* total WAL usage */
+} Instrumentation;

Maybe add a comment explaining why ntuples is in here?

After thinking about this some more, I'd think we should just go ahead and
special case trigger instrumentation, and specifically count firings of the
trigger (which was counted in "ntuples" before).

I've adjusted the 0002 patch accordingly to split out both node and trigger
instrumentation.

---
.../pg_stat_statements/pg_stat_statements.c | 4 +-
src/backend/commands/explain.c | 8 +-
src/backend/commands/trigger.c | 4 +-
src/backend/executor/execMain.c | 25 ++-
src/backend/executor/execProcnode.c | 29 +++
src/backend/executor/instrument.c | 199 ++++++++++++++----
src/include/executor/executor.h | 1 +
src/include/executor/instrument.h | 53 ++++-
8 files changed, 260 insertions(+), 63 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c
b/contrib/pg_stat_statements/pg_stat_statements.c
index f43a33b3787..eeabd820d8e 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
-                                &queryDesc->totaltime->bufusage,
-                                &queryDesc->totaltime->walusage,
+
&INSTR_GET_BUFUSAGE(queryDesc->totaltime),

+

&INSTR_GET_WALUSAGE(queryDesc->totaltime),

Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(

Maybe it's worth refactoring this a bit in a precursor patch?

I've dropped the macro (its just one additional indirection after all) - I
think we could refactor this further (i.e. pass the stack), but that
doesn't seem strictly necessary.

/*
------------------------------------------------------------------------

@@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void

*context)
return false;
}
+/*
+ * ExecAccumNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their
respective
+ * parents (or the original parent instrumentation stack).
+ */
+void
+ExecAccumNodeInstrumentation(PlanState *node)
+{
+     (void) ExecAccumNodeInstrumentation_walker(node, NULL);
+}
I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just
rename
it at that later point.

Yeah, I can't think of a better name, so I've left this the same for now.

+     if (node->instrument && node->instrument->stack.previous)
+             InstrStackAdd(node->instrument->stack.previous,
&node->instrument->stack);
+
+     return false;
+}
E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then
also
have special handling for a few node types. Do we need something like that
here too? It probably is ok, but it's worth explicitly checking and
adding a
comment.

I went through and double checked - we don't need to special case these
node types in my understanding, and I couldn't see any specific cases where
intermediary nodes would be removed in shutdown either (o.e.
causing dropped stats since a node was removed with its associated stats
not being added to the parent yet).

I added a comment to make it clear this must run after ExecShutdownNode.

+/*
+ * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+     .name = "instrumentation",
+     .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+     .release_priority = RELEASE_PRIO_FIRST,
+     .ReleaseResource = ResOwnerReleaseInstrumentation,
+     .DebugPrint = NULL,                     /* default message is fine

+};

Is there a reason to do the release here before the lock release? And why
_FIRST?

Adjusted to have its own phase, and after lock release.

+     if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
+             pgInstrStack = instr->stack.previous;
Hm - this is effectively O(stack-depth^2), right? It's probably fine,
given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?

Yeah, I added a comment - I don't see a case where this is a bottleneck
today given the limited nesting of resource owner using stacks.

+     /*
+      * Always accumulate all collected stats before the abort, even if
we
+      * already walked up the stack with an earlier resource.
+      */
+     if (pgInstrStack)
+             InstrStackAdd(pgInstrStack, &instr->stack);
Why are we accumulating stats in case of errors? It's probably fine, but
doing
less as part of cleanup is pre ferrable...

In my understanding, we need to do this in case of functions called in a
query that catch a rollback/error, since we'd otherwise not account for
that function's activity as part of the top-level query.

/* General purpose instrumentation handling */

Instrumentation *
InstrAlloc(int n, int instrument_options)
{

...
+     if (need_buffers || need_wal)
+             instr = MemoryContextAllocZero(CurTransactionContext, n *
sizeof(Instrumentation));
+     else
+             instr = palloc0(n * sizeof(Instrumentation));
Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?

I ended up refactoring this a bit, since it seemed useful to do an explicit
pfree at InstrStop or when aborting, both to avoid leaks, and to
theoretically support using TopMemoryContext here.

That said, from my testing I think CurTransactionContext is sufficient,
because we just need something that lives long enough during resource owner
abort situations (e.g. per-query context doesn't work, since the abort
frees it before we do our resource owner handling).

The pgss+CIC case isn't relevant here (I think), because utility statements
don't use the resource owner mechanism at all (with the exception of
EXPLAIN which calls into ExecutorStart), instead we use a PG_TRY/PG_FINALLY
in pg_stat_statements to pop the stack.

+void
+InstrStackAdd(InstrStack * dst, InstrStack * add)
+{
+     Assert(dst != NULL);
+     Assert(add != NULL);
+
+     BufferUsageAdd(&dst->bufusage, &add->bufusage);
+     WalUsageAdd(&dst->walusage, &add->walusage);
}
Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...

I briefly considered whether we could add this to the InstrStack itself
(i.e. whether we actually care about buffer, WAL usage or both), but I
think where it gets messy is that we can have indirect requirements to
track this - you might have pg_stat_statements capturing both, but e.g. the
utility statement being executed only caring about emitting WAL usage in
the log.

I'm also not familiar with an in-core use case today where only WAL (but
not buffers) is needed, short of doing something like "EXPLAIN (ANALYZE,
BUFFERS OFF, WAL ON)" without having pg_stat_statements or similar enabled.
Do you have a specific example where this could help?

/* note current values during parallel executor startup */

@@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage,

WalUsage *walusage)
void
InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
{
+     if (pgInstrStack != NULL)
+     {
+             InstrStack *dst = pgInstrStack;
+
+             BufferUsageAdd(&dst->bufusage, bufusage);
+             WalUsageAdd(&dst->walusage, walusage);
+     }
+
BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}
Is the pgInstrStack == NULL case actually reachable?

In my reading of the code, its necessary because we unconditionally track
WAL/buffer usage in parallel workers, even if the leader doesn't actually
need it.

We could be smarter about this (i.e. tell the workers not to collect the
information in the first place), but for now it seemed easiest to just
discard it.

Thanks,
Lukas

--
Lukas Fittl

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Lukas Fittl (#6)

Re: Stack-based tracking of per-node WAL/buffer usage

Hi,

On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote:

Attached v4 patchset that addresses feedback (unless otherwise noted below)
and is rebased on master. Other changes:

[...]
- Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
illustrates how the stack mechanism can be used to separate index and table
buffer accesses in the EXPLAIN for Index Scans

Nice!

I pushed 0001. The only changes I made were to break a few long lines.

From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other
use of Instrumentation struct

Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.

Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.

Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information, and introduce TriggerInstrumentation to
capture trigger timing and firings (previously counted in "ntuples").

Hm. It looks to me that after moving trigger instrumentation to its own thing,
there are no users of InstrAlloc() with n > 1. I think it may make sense to
drop that?

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..e87810d292e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object.  This will depend on the length of
+ * of the first NodeInstrumentation object.  This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
int			num_workers;
int			num_plan_nodes;
int			plan_node_id[FLEXIBLE_ARRAY_MEMBER];
-	/* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+	/*
+	 * array of num_plan_nodes * num_workers NodeInstrumentation objects
+	 * follows
+	 */
};

Spurious change, I think?

+
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+	TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+	bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+	bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+	bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+	int			i;
+
+	for (i = 0; i < n; i++)
+	{
+		tginstr[i].instr.need_bufusage = need_buffers;
+		tginstr[i].instr.need_walusage = need_wal;
+		tginstr[i].instr.need_timer = need_timer;
+	}
+
+	return tginstr;

Hm. Not that it's a huge difference, but I wonder if we ought to use
InstrInit() here and in InstrAlloc(), InstrAllocNode(), to avoid repetition?

+/* Trigger instrumentation */
+typedef struct TriggerInstrumentation
+{
+	Instrumentation instr;
+	int			firings;		/* number of times the instrumented trigger
+								 * was fired */
+}			TriggerInstrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
{
/* Parameters set at node creation: */
bool		need_timer;		/* true if we need timer data */
@@ -92,25 +125,34 @@ typedef struct Instrumentation
double		nfiltered2;		/* # of tuples removed by "other" quals */
BufferUsage bufusage;		/* total buffer usage */
WalUsage	walusage;		/* total WAL usage */
-} Instrumentation;
+}			NodeInstrumentation;

The indentation here will look better if you add TriggerInstrumentation,
NodeInstrumentation to typedefs.list

typedef struct WorkerInstrumentation
{
int			num_workers;	/* # of structures that follow */
-	Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+	NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;

Hm. Shouldn't this be WorkerNodeInstrumentation now?

From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 31 Aug 2025 16:34:42 -0700
Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
with INSTR_* macros

Looks sane to me.

I'd move that to the head of the queue in the next revision.

From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
stack-based instrumentation

Previously, in order to determine the buffer/WAL usage of a given code section,
we utilized continuously incrementing global counters that get updated when the
actual activity (e.g. shared block read) occurred, and then calculated a diff when
the code section ended. This resulted in a bottleneck for executor node instrumentation
specifically, with the function BufferUsageAccumDiff showing up in profiles and
in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.

Instead, introduce a stack-based mechanism, where the actual activity writes
into the current stack entry. In the case of executor nodes, this means that
each node gets its own stack entry that is pushed at InstrStartNode, and popped
at InstrEndNode. Stack entries are zero initialized (avoiding the diff mechanism)
and get added to their parent entry when they are finalized, i.e. no more
modifications can occur.

To correctly handle abort situations, any use of instrumentation stacks must
involve either a top-level Instrumentation struct, and its associated InstrStart/
InstrStop helpers (which use resource owners to handle aborts), or dedicated
PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after
an abort.

I think it may be good to have some tests for edge cases. E.g. testing that a
query that an explain analyze of a query that executes a plpgsql function
which in turn does another explain analyze or just another query without
explain analyze, does something reasonable.

@@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
+
+		/*
+		 * Check if stack is initialized - it is not when ExecutorRun wasn't
+		 * called
+		 */
+				   queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->bufusage : NULL,
+				   queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->walusage : NULL,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,

Maybe it's just the diff, but this looks a bit odd, indentation wise.

@@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
-			resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+		{
+			/*
+			 * Triggers do not individually track buffer/WAL usage, even if
+			 * otherwise tracked
+			 */

Why is that?

@@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStart called twice in a row");

-	if (instr->need_bufusage)
-		instr->bufusage_start = pgBufferUsage;
+	if (instr->need_bufusage || instr->need_walusage)
+	{
+		Assert(CurrentResourceOwner != NULL);
+		instr->owner = CurrentResourceOwner;
+
+		/*
+		 * Allocate the stack resource in a memory context that survives
+		 * during an abort. This will be freed by InstrStop (regular
+		 * execution) or ResOwnerReleaseInstrumentation (abort).
+		 *
+		 * We don't do this in InstrAlloc to avoid leaking when InstrStart +
+		 * InstrStop isn't called.
+		 */
+		if (instr->stack == NULL)
+			instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));

-	if (instr->need_walusage)
-		instr->walusage_start = pgWalUsage;
+		ResourceOwnerEnlarge(instr->owner);
+		ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
+
+		InstrPushStack(instr->stack);
+	}
}

I'm still worried that allocating something in CurTransactionContext will bite
us eventually. It seems like it'd be pretty sane to want to have
instrumentation around procedure execution, for example.

void
-InstrStop(Instrumentation *instr)
+InstrStop(Instrumentation *instr, bool finalize)
{
instr_time	endtime;

@@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}

-	/* Add delta of buffer usage since entry to node's totals */
-	if (instr->need_bufusage)
-		BufferUsageAccumDiff(&instr->bufusage,
-							 &pgBufferUsage, &instr->bufusage_start);
+	if (instr->need_bufusage || instr->need_walusage)
+	{
+		InstrPopStack(instr->stack, finalize);

-	if (instr->need_walusage)
-		WalUsageAccumDiff(&instr->walusage,
-						  &pgWalUsage, &instr->walusage_start);
+		Assert(instr->owner != NULL);
+		ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
+		instr->owner = NULL;
+
+		if (finalize)
+		{
+			/*
+			 * To avoid keeping memory allocated beyond when its needed, copy
+			 * the result to the current memory context, and free it in the
+			 * transaction context.
+			 */
+			InstrStack *stack = palloc(sizeof(InstrStack));
+
+			memcpy(stack, instr->stack, sizeof(InstrStack));
+			pfree(instr->stack);
+			instr->stack = stack;
+		}
+	}
}

Hm. I'm not entirely sure I understand the gain due to the palloc & copy &
pfree. Won't that typically increase memory usage due to temporarily having
two versions of the InstrStack?

/* Trigger instrumentation handling */
@@ -98,15 +194,20 @@ TriggerInstrumentation *
InstrAllocTrigger(int n, int instrument_options)
{
TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
-	bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
-	bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
int			i;

+	/*
+	 * To avoid having to determine when the last trigger fired, we never
+	 * track WAL/buffer usage for now
+	 */
+	Assert((instrument_options & INSTRUMENT_BUFFERS) == 0);
+	Assert((instrument_options & INSTRUMENT_WAL) == 0);
+
for (i = 0; i < n; i++)
{
-		tginstr[i].instr.need_bufusage = need_buffers;
-		tginstr[i].instr.need_walusage = need_wal;
+		tginstr[i].instr.need_bufusage = false;
+		tginstr[i].instr.need_walusage = false;
tginstr[i].instr.need_timer = need_timer;
}

That does seem problematic to me.

@@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
}

/* start instrumentation during parallel executor startup */
-void
+Instrumentation *
InstrStartParallelQuery(void)
{
-	save_pgBufferUsage = pgBufferUsage;
-	save_pgWalUsage = pgWalUsage;
+	Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+	InstrStart(instr);
+	return instr;
}

/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
{
-	memset(bufusage, 0, sizeof(BufferUsage));
-	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
-	memset(walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+	InstrStop(instr, true);
+	memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage));
+	memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage));
}

Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make
it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will
make that pretty clear...

if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long		secs_dur;
int			usecs_dur;
-			WalUsage	walusage;
-			BufferUsage bufferusage;
StringInfoData buf;
char	   *msgfmt;
int32		diff;
@@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64		total_blks_hit;
int64		total_blks_read;
int64		total_blks_dirtied;
+			BufferUsage bufusage = instr->stack->bufusage;
+			WalUsage	walusage = instr->stack->walusage;

TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
-			memset(&walusage, 0, sizeof(WalUsage));
-			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-			memset(&bufferusage, 0, sizeof(BufferUsage));
-			BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
-
-			total_blks_hit = bufferusage.shared_blks_hit +
-				bufferusage.local_blks_hit;
-			total_blks_read = bufferusage.shared_blks_read +
-				bufferusage.local_blks_read;
-			total_blks_dirtied = bufferusage.shared_blks_dirtied +
-				bufferusage.local_blks_dirtied;
+
+			total_blks_hit = bufusage.shared_blks_hit +
+				bufusage.local_blks_hit;
+			total_blks_read = bufusage.shared_blks_read +
+				bufusage.local_blks_read;
+			total_blks_dirtied = bufusage.shared_blks_dirtied +
+				bufusage.local_blks_dirtied;

Why rename and move the declaration of bufferusage here? That seems to make
the diff unnecessarily noisy?

Kinda going the other way: Why copy this on the stack, rather than use a
pointer?

diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 34b6410d6a2..77b4c59e71c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell   *p;
ParamListInfo paramLI = NULL;
EState	   *estate = NULL;
-	instr_time	planstart;
-	instr_time	planduration;
-	BufferUsage bufusage_start,
-				bufusage;
+	Instrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;

+	if (es->buffers)
+		instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+	else
+		instr = InstrAlloc(1, INSTRUMENT_TIMER);

I'd not duplicate the call to InstrAlloc().

From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Thu, 30 Oct 2025 22:27:30 -0700
Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access
reporting in EXPLAIN

This makes it clear whether activity was on the index directly, or
on the table based on heap fetches.

Yay^2.

---
src/backend/commands/explain.c | 56 ++++++++++++++++------------
src/backend/executor/execProcnode.c | 13 +++++++
src/backend/executor/instrument.c | 25 +++++++++++++
src/backend/executor/nodeIndexscan.c | 15 +++++++-
src/include/access/genam.h | 3 ++
src/include/executor/instrument.h | 3 ++
6 files changed, 91 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dd3bf615581..fb96dd5248c 100644

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d00cf820a27..f19af428d97 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);

if (node->instrument && node->instrument->stack.previous)
+ {

Perhaps we should instead (probably in the earlier patch) just make the
function return early if we don't have instrumentation / a previous node? We
might gain a few more special cases, and it's nice if they're less indented.

+		/*
+		 * Index Scan nodes account for heap buffer usage separately, so we
+		 * need to explitly add here
+		 */

s/heap/table/

Greetings,

Andres Freund

Lukas Fittl

lukas@fittl.com

about 2 months ago

In reply to: Andres Freund (#7)

Re: Stack-based tracking of per-node WAL/buffer usage

On Fri, Jan 9, 2026 at 11:38 AM Andres Freund <andres@anarazel.de> wrote:

I pushed 0001. The only changes I made were to break a few long lines.

Thanks!

Attached v5, feedback addressed unless otherwise noted, with the patch
adding macros to access pgBufferUsage/pgWalUsage (was v4/0003) to the
front per your request.

Its also worth noting that 2defd0006255 just added a new
"instrument_node.h". We could consider moving the per-node
instrumentation structs and associated functions there (but probably
keep a single "instrument.c" unit file?). Thoughts?

Hm. It looks to me that after moving trigger instrumentation to its own thing,
there are no users of InstrAlloc() with n > 1. I think it may make sense to
drop that?

Yep, makes sense, done like that. I also did a code search for any
extension use cases and couldn't find anything that'd need multiples
of the Instrumentation struct (either the new general purpose one, or
the per-node one).

+++ b/src/backend/executor/execParallel.c
...
-     /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+     /*
+      * array of num_plan_nodes * num_workers NodeInstrumentation objects
+      * follows
+      */
};

Spurious change, I think?

pgindent forces the line break here, because of the Instrumentation =>
NodeInstrumentation change.

Are you saying we shouldn't make that wording change? then I'd suggest
we lower case the "i" of "instrumentation", so it doesn't get confused
with the struct of that name.

typedef struct WorkerInstrumentation
{
int                     num_workers;    /* # of structures that follow */
-     Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+     NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;

Hm. Shouldn't this be WorkerNodeInstrumentation now?

Yeah, the name feels a bit lengthy, but given that I found some of the
parallel query instrumentation logic confusing, erring on the side of
being more explicit is probably good here. Done that way.

Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
stack-based instrumentation

I think it may be good to have some tests for edge cases. E.g. testing that a
query that an explain analyze of a query that executes a plpgsql function
which in turn does another explain analyze or just another query without
explain analyze, does something reasonable.

Yeah, that makes sense - I ran out of time to add this now (and I feel
there are other things worth discussing in the meantime), but will
work on this.

@@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
-                     resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+             {
+                     /*
+                      * Triggers do not individually track buffer/WAL usage, even if
+                      * otherwise tracked
+                      */

Why is that?

I originally special cased triggers (i.e. didn't permit trigger
instrumentation to track WAL/buffer usage) because the current code
does not need it, and it simplified the logic, because we always need
to add child stack information to the parent so that totals are
correct, e.g. for pg_stat_statements.

But, I see your point that this isn't great, and we may want to emit
WAL/buffer usage for triggers in EXPLAIN in the future. To handle this
I adjusted this so we now also finalize the per-trigger
instrumentation in ExecutorFinish, like we do it for per-node
instrumentation.

@@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStart called twice in a row");

-     if (instr->need_bufusage)
-             instr->bufusage_start = pgBufferUsage;
+     if (instr->need_bufusage || instr->need_walusage)
+     {
+             Assert(CurrentResourceOwner != NULL);
+             instr->owner = CurrentResourceOwner;
+
+             /*
+              * Allocate the stack resource in a memory context that survives
+              * during an abort. This will be freed by InstrStop (regular
+              * execution) or ResOwnerReleaseInstrumentation (abort).
+              *
+              * We don't do this in InstrAlloc to avoid leaking when InstrStart +
+              * InstrStop isn't called.
+              */
+             if (instr->stack == NULL)
+                     instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));

-     if (instr->need_walusage)
-             instr->walusage_start = pgWalUsage;
+             ResourceOwnerEnlarge(instr->owner);
+             ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
+
+             InstrPushStack(instr->stack);
+     }
}

Hmm, I'll have to spend more time thinking through that future
procedure instrumentation case and how it interacts with resource
owners.

Just to state it, the main point of using CurTransactionContext
(instead of CurrentMemoryContext) here is that we handle an abort
situation that may propagate above the function that called
InstrStart. We need to make sure memory survives until the resource
owner gets cleaned up (either by aborting, or by InstrStop being
called with finalize = true), so we can add the stack to its parent.

I'm happy to adjust this to TopTransactionContext if that's what you
had in mind?

Or maybe I'm missing some other way to get a memory context that
relates to the current resource owner? (i.e. survives at least until
the resource owner gets released)

void
-InstrStop(Instrumentation *instr)
+InstrStop(Instrumentation *instr, bool finalize)
{
instr_time      endtime;

@@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}

-     /* Add delta of buffer usage since entry to node's totals */
-     if (instr->need_bufusage)
-             BufferUsageAccumDiff(&instr->bufusage,
-                                                      &pgBufferUsage, &instr->bufusage_start);
+     if (instr->need_bufusage || instr->need_walusage)
+     {
+             InstrPopStack(instr->stack, finalize);

-     if (instr->need_walusage)
-             WalUsageAccumDiff(&instr->walusage,
-                                               &pgWalUsage, &instr->walusage_start);
+             Assert(instr->owner != NULL);
+             ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
+             instr->owner = NULL;
+
+             if (finalize)
+             {
+                     /*
+                      * To avoid keeping memory allocated beyond when its needed, copy
+                      * the result to the current memory context, and free it in the
+                      * transaction context.
+                      */
+                     InstrStack *stack = palloc(sizeof(InstrStack));
+
+                     memcpy(stack, instr->stack, sizeof(InstrStack));
+                     pfree(instr->stack);
+                     instr->stack = stack;
+             }
+     }
}

Hm. I'm not entirely sure I understand the gain due to the palloc & copy &
pfree. Won't that typically increase memory usage due to temporarily having
two versions of the InstrStack?

This is mainly about making the caller to InstrStop take
responsibility for freeing the memory allocation when its current
context gets freed, which will potentially be before
CurTransactionContext (or whichever context we chose) gets freed. If
we kept it around and e.g. this was a long transaction with many
queries executed, we'd keep consuming more and more memory otherwise.

Or looking at it differently, the principle here is, if you've reached
InstrStop (with finalize=true), no aborts involving this stack can
occur anymore, and so all that's left is for the caller to utilize the
instrumentation data as it sees fit, e.g. emitting it to the logs,
etc.

/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make
it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will
make that pretty clear...

Good point, why not, added pg_nodiscard - seems like cheap insurance
to help notice someone they're calling it incorrectly.

@@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64           total_blks_hit;
int64           total_blks_read;
int64           total_blks_dirtied;
+                     BufferUsage bufusage = instr->stack->bufusage;
+                     WalUsage        walusage = instr->stack->walusage;
...
-                     total_blks_dirtied = bufferusage.shared_blks_dirtied +
-                             bufferusage.local_blks_dirtied;
+
+                     total_blks_hit = bufusage.shared_blks_hit +

...

Why rename and move the declaration of bufferusage here? That seems to make
the diff unnecessarily noisy?

Kinda going the other way: Why copy this on the stack, rather than use a
pointer?

For now to keep the diff smaller, I went the way of not renaming the variable.

I don't think efficiency matters much in that code path (i.e. the copy
of ~150 bytes seems fine to me), but happy to adjust if you think a
pointer is better.

Thanks,
Lukas

--
Lukas Fittl

Peter Smith

smithpb2250@gmail.com

about 2 months ago

In reply to: Andres Freund (#7)

Re: Stack-based tracking of per-node WAL/buffer usage

On Sat, Jan 10, 2026 at 6:38 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote:

Attached v4 patchset that addresses feedback (unless otherwise noted below)
and is rebased on master. Other changes:

[...]
- Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
illustrates how the stack mechanism can be used to separate index and table
buffer accesses in the EXPLAIN for Index Scans

Nice!

I pushed 0001. The only changes I made were to break a few long lines.

I happened to be reading the code in this recent push [1]https://github.com/postgres/postgres/commit/e5a5e0a90750d665cab417322b9f85c806430d85#diff-d927962e2574e803c27ea9a429eeecf7bc29c7a38c830ccb3a10e9e3da5ba357R187 and saw this
new macro:

+#define INSTR_TIME_LT(x,y) \
+ ((x).ticks > (y).ticks)

Is that macro name OK? It seemed backwards to me. Shouldn't it be
called INSTR_TIME_GT because it is checking that x is "Greater Than"
y?

======
[1]: https://github.com/postgres/postgres/commit/e5a5e0a90750d665cab417322b9f85c806430d85#diff-d927962e2574e803c27ea9a429eeecf7bc29c7a38c830ccb3a10e9e3da5ba357R187

Kind Regards.
Peter Smith.
Fujitsu Australia

#10

Lukas Fittl

lukas@fittl.com

about 2 months ago

In reply to: Peter Smith (#9)

Re: Stack-based tracking of per-node WAL/buffer usage

On Thu, Jan 15, 2026 at 2:06 PM Peter Smith <smithpb2250@gmail.com> wrote:

I happened to be reading the code in this recent push [1] and saw this
new macro:

+#define INSTR_TIME_LT(x,y) \
+ ((x).ticks > (y).ticks)

Is that macro name OK? It seemed backwards to me. Shouldn't it be
called INSTR_TIME_GT because it is checking that x is "Greater Than"
y?

Oh yeah, good catch. I think I must have thought of "larger than"
instead of "less than".

I think adjusting that to INSTR_TIME_GT makes sense, and is consistent
with how "lt" and "gt" is used elsewhere in the source.

Thanks,
Lukas

--
Lukas Fittl

#11

Lukas Fittl

lukas@fittl.com

14 days ago

In reply to: Lukas Fittl (#10)

Re: Stack-based tracking of per-node WAL/buffer usage

Hi,

Attached v6:

0001 addresses the issue that Peter raised and corrects the macro name
+ adds a missing comment.

0002/0003 are the same as in v5 and are preparatory commits that
should be in good shape to go.

0004 adds new regression tests that pass on both master and the
patched version, and stress test some of the nested instrumentation
handling.

0005 is what was previously 0003, and introduces the stack-based
instrumentation mechanism. I've made several changes here:

- Rename pgInstrStack => CurrentInstrStack
- Always initialize a top-level stack (TopInstrStack) to avoid
condition when writing to stack
- Rename ExecAccumNodeInstrumentation to
ExecFinalizeNodeInstrumentation, and skip it for parallel workers to
avoid double counting (covered by new test cases). The new
InstrFinalizeNode method that accompanies this now also copies the
per-node instrumentation into the per-query context when in the
regular non-abort case. I think this is necessary when we have a
long-running transaction with auto_explain enabled and many fast
statements, so we don't want to keep NodeInstrumentation until
transaction end (since each NodeInstrumentation is ~288 bytes).
- Accumulate unfinalized per-node stacks on abort (so we can rely on
accumulated totals to be correct) - note that this requires an
additional tree walk in the first ExecutorRun because we lazily
initialize the query->totaltime stack
- Moved the non-node instr stack allocations to TopMemoryContext -
after further testing and review I agree that's necessary because of
utility statements. The specific test case that showed that was
pgss_ProcessUtility and how it behaves with the new utility.sql test
case, as well as VACUUM commands in stream regress.
- Combined parallel worker changes into this commit (it didn't really
make sense to keep these separate based on some of the edge cases)
- Explicit handling of orphaned stack entries when we experience
out-of-order cleanup during abort
- Added a lot more comments (too many?) to explain different edge cases.

0006 could probably be merged into 0005, and is the existing
mechanical change to drop the pgBufferUsage global, but kept separate
for easier review for now.

0007 is the stack-mechanism for splitting out "Table Buffers" for
Index Scans. This has been changed to:

- Report combined index/table buffers as "Buffers" and only show
"Table Buffers" broken out. This was suggested off-list by Andres, and
I've realized this is also otherwise confusing when there are child
nodes below that accumulate up.
- Fix a bug relating to parallel query reporting where previously the
table stack contents would not be copied from parallel workers to the
leader.
- Add documentation and update EXPLAIN ANALYZE examples that reference
Index Scan nodes.

0008 adds a pg_session_buffer_usage() module for testing the global
counters. This helps to verify the handling of aborts behave sanely
(and can run on both master and patched), but I don't think we should
commit this.

---

Two questions on my mind:

1) Should we call this "instrumentation context" (InstrContext?)
instead of "instrumentation stack" (InstrStack)? When writing comments
I found the difference between "stack entry" and "stack" confusing at
times, and "context" feels a bit more clear as a term (e.g. as in
"CurrentInstrContext" or "CurrentInstrumentationContext").

2) For 0007, "Table Buffers" feels a bit inconsistent with "Heap
Fetches" and "Heap Blocks" used elsewhere, should we potentially use
"Heap Buffers", or change the existing names to "Table ..." instead?

And one follow-up on a prior note:

On Tue, Jan 13, 2026 at 12:01 AM Lukas Fittl <lukas@fittl.com> wrote:

Its also worth noting that 2defd0006255 just added a new
"instrument_node.h". We could consider moving the per-node
instrumentation structs and associated functions there (but probably
keep a single "instrument.c" unit file?). Thoughts?

I've consider this again and think that's not a good idea, because
instrument_node.h seems specifically focused on parallel query - so
I've left all new methods/structs in instrument.h for now.

Thanks,
Lukas

--
Lukas Fittl

#12

Lukas Fittl

lukas@fittl.com

2 days ago

In reply to: Lukas Fittl (#11)

Re: Stack-based tracking of per-node WAL/buffer usage

On Mon, Feb 23, 2026 at 8:18 PM Lukas Fittl <lukas@fittl.com> wrote:

0001 addresses the issue that Peter raised and corrects the macro name
+ adds a missing comment.

This has been pushed by Andres, thanks again Peter for noting this!

See attached v7, with the changes as noted later. But first, some
fresh performance numbers that take into account extra inlining added
in a new v7/0006:

Example (default shared_buffers, runtimes are best out of 3-ish):

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

250ms actual runtime (no instrumentation)

BUFFERS OFF, TIMING OFF:
295ms master
295ms with stack-based instrumentation only (v7/0005) -- no change
because BUFFERS OFF
260ms with ExecProcNodeInstr inlining work (v7/0006)

BUFFERS ON, TIMING OFF:
380ms master
305ms with stack-based instrumentation only (v7/0005)
280ms with ExecProcNodeInstr inlining work (v7/0006)

In summary: For BUFFERS ON, we're going from 52% overhead in this
stress test, to 12% overhead (22% without the ExecProcNodeInstr
change). With rows instrumentation only, we go from 18% to 3%
overhead.

0002/0003 are the same as in v5 and are preparatory commits that
should be in good shape to go.

I split out the trigger instrumentation refactoring into its own
commit (0001) per Andres' request off-list.

0002 is the node instrumentation refactoring, which changed slightly,
because Instrumentation is now treated as a base struct and
NodeInstrumentation now contains an Instrumentation "instr" field to
reduce duplication.

0003 replaces pgBufferUsage/pgWalUsage with INSTR_* macros, pretty
much unchanged from v6/0002.

0004 adds new regression tests that pass on both master and the
patched version, and stress test some of the nested instrumentation
handling.

This patch is the same in v7 as in v6.

0005 is what was previously 0003, and introduces the stack-based
instrumentation mechanism.

0005 is still the stack-based instrumentation commit, but has several
improvements over v6:

- Andres suggested a different stack structure in a conversation
off-list, where we track the stack entries in a separately kept array
instead of with inline pointers on the Instrumentation structure -
that ends up making the whole change a lot easier to reason about, and
clarifies the abort handling as well
- I've separated out the ResOwner handling into a new
"QueryInstrumentation" struct - this makes it clearer that one can use
Instrumentation directly with certain caveats (i.e. one must deal with
aborts directly), and feels better now that we're including
Instrumentation in NodeInstrumentation as well

0006 could probably be merged into 0005, and is the existing
mechanical change to drop the pgBufferUsage global, but kept separate
for easier review for now.

I've merged that together now into 0005 because it feels better to
reason about this as a whole and drop pgBufferUsage within the same
commit where we're adding the stack-based instrumentation.

There is now a new 0006 patch that further optimizes ExecProcNodeInstr
(with performance numbers as noted in the beginning of this mail).
During my testing I realized that the conditional checks in
InstrStartNode/InstrStopNode are quite unnecessary - we never change
what gets instrumented during execution, and so we can create
specialized functions for the different instrumentation combinations,
giving the compiler a much better chance at generating optimal
instructions. The assembly here looks a lot better now.

0007 is the stack-mechanism for splitting out "Table Buffers" for
Index Scans.

This stayed pretty much the same from the earlier version, but I
reworked this a bit to avoid special Instr functions for dealing with
this case, instead it re-uses NodeInstrumentation and
InstrStartNode/InstrStopNode.

0008 adds a pg_session_buffer_usage() module for testing the global
counters. This helps to verify the handling of aborts behave sanely
(and can run on both master and patched), but I don't think we should
commit this.

This is identical with v6/0008, but continues to prove very useful in
testing the refactorings.

Two questions on my mind:

1) Should we call this "instrumentation context" (InstrContext?)
instead of "instrumentation stack" (InstrStack)? When writing comments
I found the difference between "stack entry" and "stack" confusing at
times, and "context" feels a bit more clear as a term (e.g. as in
"CurrentInstrContext" or "CurrentInstrumentationContext").

With the refactoring of the stack-mechanism done now, I think we don't
need to change the naming here, since "InstrStack" as a commonly used
struct is gone now (its just Instrumentation now), so I'd suggest
keeping the "stack" terminology to describe the approach itself.

2) For 0007, "Table Buffers" feels a bit inconsistent with "Heap
Fetches" and "Heap Blocks" used elsewhere, should we potentially use
"Heap Buffers", or change the existing names to "Table ..." instead?

That is still an open question.

Thanks,
Lukas

--
Lukas Fittl

#13

Lukas Fittl

lukas@fittl.com

2 days ago

In reply to: Lukas Fittl (#12)

Re: Stack-based tracking of per-node WAL/buffer usage

On Sat, Mar 7, 2026 at 8:27 PM Lukas Fittl <lukas@fittl.com> wrote:

Example (default shared_buffers, runtimes are best out of 3-ish):

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

250ms actual runtime (no instrumentation)

BUFFERS OFF, TIMING OFF:
295ms master
295ms with stack-based instrumentation only (v7/0005) -- no change
because BUFFERS OFF
260ms with ExecProcNodeInstr inlining work (v7/0006)

BUFFERS ON, TIMING OFF:
380ms master
305ms with stack-based instrumentation only (v7/0005)
280ms with ExecProcNodeInstr inlining work (v7/0006)

In summary: For BUFFERS ON, we're going from 52% overhead in this
stress test, to 12% overhead (22% without the ExecProcNodeInstr
change). With rows instrumentation only, we go from 18% to 3%
overhead.

Erm, and I forgot the query here, this is testing "SELECT count(*)
FROM lotsarows;", just like over in [0]/messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de.

Thanks,
Lukas

[0]: /messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de

--
Lukas Fittl

#14

Zsolt Parragi

zsolt.parragi@percona.com

about 16 hours ago

In reply to: Lukas Fittl (#12)

Re: Stack-based tracking of per-node WAL/buffer usage

Hello

+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
  if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);

In ExecFinalizeNodeInstrumentation InstrFinalizeNode pfrees the
original instrumentation, but doesn't remove it from the
unfinalized_children list. In normal execution in
InstrQueryStopFinalize ResourceOwnerForgetInstrumentation handles
this, but what about the error path, if something happens between the
two?
Won't we end up in ResOwnerReleaseInstrumentation and do use after
free reads and then a double free on the now invalid pointer?

+ if (myState->es->timing || myState->es->buffers)
+ instr = InstrQueryStopFinalize(instr);
+

Is it okay to leak 1 instrumentation copy per tuple in the query
context? This freshly palloced object will be out of scope a few lines
after this.

@@ -128,8 +130,22 @@ IndexNext(IndexScanState *node)

IndexNextWithReorder doesn't need the same handling?

+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");

Shouldn't this say index scans? (there's another preexisting
indexonylscan mention in this file, but that also seems wrong)

+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif

I don't see pgBufferUsage anywhere else in the current code, it was
probably removed between rebases?

+ char    *prefix = title ? psprintf("%s ", title) : pstrdup("");
+
+ ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
     usage->shared_blks_hit, es);

(And many similar ExplainPropery after this)

title is NULL most of the time, and this results in 16 allocations for
that common case - isn't there a better solution like using
ExplainOpenGroup or something?

+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, instead call InstrPopAndFinalizeStack which can skip intermediate
+ * stack entries, or instead use InstrStart/InstrStop.

InstrPopAndFinalizeStack doesn't exists in the latest patch version

#15

Lukas Fittl

lukas@fittl.com

about 15 hours ago

In reply to: Zsolt Parragi (#14)

Re: Stack-based tracking of per-node WAL/buffer usage

Hi Zsolt,

Thanks for reviewing!

On Mon, Mar 9, 2026 at 2:55 PM Zsolt Parragi <zsolt.parragi@percona.com> wrote:

+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
In ExecFinalizeNodeInstrumentation InstrFinalizeNode pfrees the
original instrumentation, but doesn't remove it from the
unfinalized_children list. In normal execution in
InstrQueryStopFinalize ResourceOwnerForgetInstrumentation handles
this, but what about the error path, if something happens between the
two?
Won't we end up in ResOwnerReleaseInstrumentation and do use after
free reads and then a double free on the now invalid pointer?

ResourceOwnerForgetInstrumentation directly follows the call to
ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not
sure which error case you're thinking of?

We could explicitly zap the unfinalized_children list at the end of
ExecFinalizeNodeInstrumentation (and have it take a
QueryInstrumentation argument) to protect against this, but I don't
think that makes much of a difference with the code as currently
written.

Maybe I'm misunderstanding which situation you're thinking of?

+ if (myState->es->timing || myState->es->buffers)
+ instr = InstrQueryStopFinalize(instr);
+
Is it okay to leak 1 instrumentation copy per tuple in the query
context? This freshly palloced object will be out of scope a few lines
after this.

I don't think that's a permanent leak, since it would be in the memory
context of the caller, i.e. the per-query memory context, but yeah,
this doesn't seem ideal, since we're keeping this memory around for
each tuple processed.

First of all, we could just do a pfree in serializeAnalyzeReceive
after we added the stats, but that said, this extra instrumentation
for EXPLAIN (SERIALIZE) is a bit of a curious edge case I suppose,
since serializeAnalyzeReceive gets called once per tuple - and so
doing the ResOwner dance for each tuple is probably not ideal.

I wonder if maybe we shouldn't treat this as a case of
NodeInstrumentation instead, and attach to the query's totaltime
instrumentation for abort safety. The only thing that's a bit
inconvenient about that is that calling
InstrQueryRememberNode/InstrFinalizeNode requires passing in that
parent QueryInstrumentation, and the SerializeDestReceiver doesn't
currently have easy access to that. If we brute-force solve that we
could just add an extra method to set it after CreateQueryDesc is
called, but:

Stepping back, from an overall API design perspective, we could also
track the current QueryInstrumentation in a global variable - that'd
make it easier to attach/finalize extra instrumentation on it even if
we don't have direct access to querydesc->totaltime, or we're in a
non-executor context that uses QueryInstrumentation (like some utility
commands) for future use cases of extra sub-query-level
instrumentation. For example, that could also come in handy if we
wanted VACUUM VERBOSE break out the buffer access it does on indexes
vs tables.

@@ -128,8 +130,22 @@ IndexNext(IndexScanState *node)

IndexNextWithReorder doesn't need the same handling?

Yes - good point, that's an unintentional omission.

For context, I hadn't spent substantial amounts of time on this part
of the series (rather on getting the design of the stack mechanism
right), but will fix this in the next revision to also handle
IndexNextWithReorder.

I've also been thinking whether we should teach Index-only Scans to
break out the per-table buffer stats for the (rare) heap fetches. I
guess we should? (it'd be fairly easy to instrument that
index_fetch_heap there now)

+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
Shouldn't this say index scans? (there's another preexisting
indexonylscan mention in this file, but that also seems wrong)

Hmm. Looking at this again, I think the handling of xs_heap_continue
isn't right altogether. Basically it should be this instead, I think,
so we correctly call the table AM's table_index_fetch_tuple again if
call_again gets set:

while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
if (node->iss_InstrumentTable)
InstrPushStack(&node->iss_InstrumentTable->instr);

for (;;)
{
found = index_fetch_heap(scandesc, slot);
if (found || !scandesc->xs_heap_continue)
break;
}

if (node->iss_InstrumentTable)
InstrPopStack(&node->iss_InstrumentTable->instr);

if (unlikely(!found))
continue;

Which matches the loop in index_getnext_slot (i.e keep calling
index_fetch_heap if xs_heap_continue=true). Based on that, we wouldn't
need the elog at all.

+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
I don't see pgBufferUsage anywhere else in the current code, it was
probably removed between rebases?

That's intentional to allow testing pg_session_buffer_usage both on
master (HAVE_INSTR_STACK=0) and with the patched version
(HAVE_INSTR_STACK=1) - mainly to ensure we're matching behavior for
anyone interested in the top-line buffer or WAL numbers. I don't think
we should commit pg_session_buffer_usage (at least not as-is), its
just there for testing the earlier changes.

+ char    *prefix = title ? psprintf("%s ", title) : pstrdup("");
+
+ ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
usage->shared_blks_hit, es);
(And many similar ExplainPropery after this)

title is NULL most of the time, and this results in 16 allocations for
that common case - isn't there a better solution like using
ExplainOpenGroup or something?

Yeah, I guess that's fair - I don't know if the extra allocations
really matter, but I can see your point.

If we used ExplainOpenGroup that would make it a nested structure in
the JSON/etc representation, so it'd look like this:

"Shared Read Blocks:" 123.0,
...
"Temp I/O Write Time:" 42.0,
"Table": {
"Shared Read Blocks:" 123.0,
...
"Temp I/O Write Time:" 42.0,
}

compared to text representation (simplified):

Buffers: shared read=123.0
I/O Timings: temp write=42.0
Table Buffers: shared read=123.0
Table I/O Timings: temp write=42.0

I think that can work, and we have precedence for using groups in a
node like this, e.g. with "Workers".

If we go with this, I do wonder if we should clarify the JSON/etc
group key as "Table Access" (one group) or "Table Buffers" / "Table
I/O" (two groups), instead of just "Table"?

+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, instead call InstrPopAndFinalizeStack which can skip intermediate
+ * stack entries, or instead use InstrStart/InstrStop.

InstrPopAndFinalizeStack doesn't exists in the latest patch version

Ah, yeah, that should say InstrStopFinalize (basically direct use of
InstrPushStack/InstrPopStack is discouraged over use of the
Instrumentation functions), I'll fix it in the next revision.

Thanks,
Lukas

--
Lukas Fittl

#16

Zsolt Parragi

zsolt.parragi@percona.com

about 6 hours ago

In reply to: Lukas Fittl (#15)

Re: Stack-based tracking of per-node WAL/buffer usage

ResourceOwnerForgetInstrumentation directly follows the call to
ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not
sure which error case you're thinking of?

There are a few pallocs between them, so OOM is possible, even if
unlikely. I mainly mentioned this because even if unlikely it can
happen in theory, and the fix seems simple to me.

I don't think that's a permanent leak, since it would be in the memory
context of the caller, i.e. the per-query memory context

Yes, it's definitely not permanent, but could be bad with many tuples.

and so
doing the ResOwner dance for each tuple is probably not ideal.

These approaches are interesting, but also add complexity, so I'm
unsure which is better for this, the pfree calls add one line and
solve the main issue with the current code.

Basically it should be this instead, I think,
so we correctly call the table AM's table_index_fetch_tuple again if
call_again gets set:

Right, this code will be better.

I don't know if the extra allocations
really matter, but I can see your point.

Yeah, probably doesn't matter that much, but the code also wasn't that
nice in that form. I didn't try to actually modify it, but by just
looking at it the grouped option seemed cleaner to me, and the output
should also be self-explanatory and logical to users.

Stack-based tracking of per-node WAL/buffer usage

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: