[RFC PATCH v0 0/7] Add EXPLAIN ANALYZE wait event reporting

Started by Ilmar Yunusov16 days ago11 messageshackers
Jump to latest
#1Ilmar Yunusov
tanswis42@gmail.com

This RFC prototype adds `EXPLAIN (ANALYZE, WAITS)`, which reports
completed wait intervals observed through `pgstat_report_wait_start/end()`.
The option is named `WAITS` in this RFC to match the short style of
`BUFFERS`, `WAL`, `IO`, and `MEMORY`. I am not attached to the exact name;
`WAIT_EVENTS` may be clearer but is more verbose.

PostgreSQL already exposes a backend's current wait event through
pg_stat_activity. This patch explores making the same wait event
instrumentation useful in EXPLAIN ANALYZE by collecting per-statement and
per-plan-node wait event usage while a statement executes.

Statement-level output is reported as `Statement Wait Events`. It counts
each completed wait once per active statement-level collector and includes
parallel worker waits. Nested EXPLAIN ANALYZE WAITS collectors maintain
separate statement-level summaries; a wait is counted once in each active
collector.

Plan-node output is reported as `Wait Events`. Node-level attribution is
intentionally inclusive, matching EXPLAIN ANALYZE node timing: a wait is
attributed to every active plan node captured when the wait begins. This
means parent and child nodes can show the same wait, and node-level wait
times must not be summed to compute a statement total.

The implementation keeps wait-end accounting allocation-free. Each
statement and plan-node accumulator preallocates storage for 64 distinct
wait event identities; additional distinct identities are accumulated in
`Unrecorded Wait Event Calls` and `Unrecorded Wait Event Time` without event
identity. The fixed bound is intended to make the wait-end path predictable
and safe in places where allocation would be undesirable. The overflow
bucket preserves total calls/time, but loses per-event identity. This is a
deliberate RFC point.

Patch layout:

1. add statement-level EXPLAIN WAITS reporting;
2. aggregate statement-level waits from parallel workers;
3. add plan-node wait attribution, including manual executor paths;
4. refine attribution semantics, docs, overflow output, and tests;
5. harden accumulator handling and keep wait-end allocation-free;
6. hide accumulator internals behind the wait-event accounting API;
7. update EXPLAIN option tab completion.

Important review questions:

- Is the `WAITS` option name and output shape acceptable, or should this be
`WAIT_EVENTS` / different labels?
- Is inclusive per-node attribution the right semantic for EXPLAIN?
- Is the fixed 64-entry accumulator plus explicit overflow bucket acceptable?
- Is the disabled hot-path overhead of checking an exported boolean in
pgstat_report_wait_start/end acceptable?
- Are the test scaffolding choices acceptable, especially planner GUCs and
pg_sleep wrappers used to force deterministic wait-attribution cases? The
tests use pg_sleep only to force a stable Timeout:PgSleep wait identity;
durations are normalized by the existing EXPLAIN test filters.

Local verification so far:

- `make -s -j4`
- `make -C doc/src/sgml check`
- `make -s -C src/bin/psql`
- `make -C src/test/regress check-tests TESTS='test_setup create_index explain'`
- `git diff --check`

The final diff of this 7-patch branch is identical to the development branch
`r314tive/pg-wait-explain-mvp`.

Local optimized macOS microbenchmarks are directional only. The current
synthetic C wait-loop run measured roughly 0.1-0.2 ns/wait disabled overhead
and about 30 ns/wait enabled accounting for a single active node. These
numbers are not intended as performance evidence for commit; they only served
as a local smoke check that the disabled path is plausibly small. I would
want repeated Linux, CPU-pinned numbers before drawing stronger conclusions.

Ilmar Yunusov (7):
Add EXPLAIN WAITS statement reporting
Aggregate EXPLAIN WAITS from parallel workers
Attribute EXPLAIN WAITS to plan nodes
Refine EXPLAIN WAITS attribution semantics
Harden EXPLAIN WAITS accumulator handling
Hide EXPLAIN WAITS accumulator internals
Keep EXPLAIN option completion current

doc/src/sgml/ref/explain.sgml | 61 ++++
src/backend/commands/explain.c | 172 +++++++++-
src/backend/commands/explain_state.c | 8 +
src/backend/executor/execAsync.c | 22 ++
src/backend/executor/execMain.c | 1 +
src/backend/executor/execParallel.c | 295 ++++++++++++++++-
src/backend/executor/execProcnode.c | 24 +-
src/backend/executor/execUtils.c | 1 +
src/backend/executor/instrument.c | 7 +
src/backend/executor/nodeBitmapAnd.c | 7 +
src/backend/executor/nodeBitmapIndexscan.c | 7 +
src/backend/executor/nodeBitmapOr.c | 7 +
src/backend/executor/nodeHash.c | 7 +
src/backend/utils/activity/wait_event.c | 363 +++++++++++++++++++++
src/bin/psql/tab-complete.in.c | 6 +-
src/include/commands/explain_state.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 1 +
src/include/nodes/execnodes.h | 3 +
src/include/utils/wait_event.h | 45 +++
src/test/regress/expected/explain.out | 202 ++++++++++++
src/test/regress/sql/explain.sql | 144 ++++++++
22 files changed, 1371 insertions(+), 15 deletions(-)

--
2.52.0

#2Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#1)
[RFC PATCH v0 1/7] Add EXPLAIN WAITS statement reporting

---
doc/src/sgml/ref/explain.sgml | 14 +++
src/backend/commands/explain.c | 129 +++++++++++++++++++-
src/backend/commands/explain_state.c | 8 ++
src/backend/utils/activity/wait_event.c | 149 ++++++++++++++++++++++++
src/include/commands/explain_state.h | 1 +
src/include/utils/wait_event.h | 29 +++++
src/test/regress/expected/explain.out | 17 +++
src/test/regress/sql/explain.sql | 5 +
8 files changed, 349 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index e95e19081e1..5460568af1d 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -47,6 +47,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
     SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
     MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
     IO [ <replaceable class="parameter">boolean</replaceable> ]
+    WAITS [ <replaceable class="parameter">boolean</replaceable> ]
     FORMAT { TEXT | XML | JSON | YAML }
 </synopsis>
  </refsynopsisdiv>
@@ -310,6 +311,19 @@ ROLLBACK;
     </listitem>
    </varlistentry>
+   <varlistentry>
+    <term><literal>WAITS</literal></term>
+    <listitem>
+     <para>
+      Include exact backend-local wait event timing observed during query
+      execution.  Wait events are grouped by wait event type and wait event
+      name, and include the number of waits and total time spent waiting.
+      This parameter may only be used when <literal>ANALYZE</literal> is also
+      enabled.  It defaults to <literal>FALSE</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>FORMAT</literal></term>
     <listitem>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 112c17b0d64..0b7cc5c15c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -45,6 +45,7 @@
 #include "utils/tuplesort.h"
 #include "utils/tuplestore.h"
 #include "utils/typcache.h"
+#include "utils/wait_event.h"
 #include "utils/xml.h"
@@ -149,6 +150,9 @@ static const char *explain_get_index_name(Oid indexId);
 static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
+static int	wait_event_usage_cmp(const void *a, const void *b);
+static void show_wait_event_usage(ExplainState *es,
+								  const WaitEventUsage *usage);
 static void show_memory_counters(ExplainState *es,
 								 const MemoryContextCounters *mem_counters);
 static void show_result_replacement_info(Result *result, ExplainState *es);
@@ -510,6 +514,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	int			eflags;
 	int			instrument_option = 0;
 	SerializeMetrics serializeMetrics = {0};
+	WaitEventUsage waitEventUsage;
+	WaitEventUsage *waitEventUsagePtr = NULL;

Assert(plannedstmt->commandType != CMD_UTILITY);

@@ -583,11 +589,27 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
else
dir = ForwardScanDirection;

+		if (es->waits)
+		{
+			waitEventUsagePtr = &waitEventUsage;
+			pgstat_begin_wait_event_usage(waitEventUsagePtr,
+										  queryDesc->estate->es_query_cxt);
+		}
+
 		/* run the plan */
-		ExecutorRun(queryDesc, dir, 0);
+		PG_TRY();
+		{
+			ExecutorRun(queryDesc, dir, 0);
-		/* run cleanup too */
-		ExecutorFinish(queryDesc);
+			/* run cleanup too */
+			ExecutorFinish(queryDesc);
+		}
+		PG_FINALLY();
+		{
+			if (waitEventUsagePtr)
+				pgstat_end_wait_event_usage(waitEventUsagePtr);
+		}
+		PG_END_TRY();

/* We can't run ExecutorEnd 'till we're done printing the stats... */
totaltime += elapsed_time(&starttime);
@@ -605,6 +627,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);

+	if (waitEventUsagePtr)
+		show_wait_event_usage(es, waitEventUsagePtr);
+
 	/* Show buffer and/or memory usage in planning */
 	if (peek_buffer_usage(es, bufusage) || mem_counters)
 	{
@@ -4503,6 +4528,104 @@ show_wal_usage(ExplainState *es, const WalUsage *usage)
 	}
 }
+static int
+wait_event_usage_cmp(const void *a, const void *b)
+{
+	const WaitEventUsageEntry *ea = (const WaitEventUsageEntry *) a;
+	const WaitEventUsageEntry *eb = (const WaitEventUsageEntry *) b;
+	int64		ta = INSTR_TIME_GET_MICROSEC(ea->time);
+	int64		tb = INSTR_TIME_GET_MICROSEC(eb->time);
+
+	if (ta < tb)
+		return 1;
+	if (ta > tb)
+		return -1;
+	if (ea->wait_event_info < eb->wait_event_info)
+		return -1;
+	if (ea->wait_event_info > eb->wait_event_info)
+		return 1;
+	return 0;
+}
+
+static void
+show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+{
+	WaitEventUsageEntry *entries;
+
+	if (usage == NULL)
+		return;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+		return;
+
+	if (usage->nentries > 0)
+	{
+		entries = palloc_array(WaitEventUsageEntry, usage->nentries);
+		memcpy(entries, usage->entries,
+			   sizeof(WaitEventUsageEntry) * usage->nentries);
+		qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+			  wait_event_usage_cmp);
+	}
+	else
+		entries = NULL;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		ExplainIndentText(es);
+		appendStringInfoString(es->str, "Wait Events:\n");
+		es->indent++;
+
+		for (int i = 0; i < usage->nentries; i++)
+		{
+			const char *event_type;
+			const char *event_name;
+
+			event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+			event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "%s:%s calls=%" PRIu64 " time=%0.3f ms\n",
+							 event_type ? event_type : "Unknown",
+							 event_name ? event_name : "unknown",
+							 entries[i].calls,
+							 INSTR_TIME_GET_MILLISEC(entries[i].time));
+		}
+
+		es->indent--;
+	}
+	else
+	{
+		ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
+
+		for (int i = 0; i < usage->nentries; i++)
+		{
+			const char *event_type;
+			const char *event_name;
+
+			event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+			event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+			ExplainOpenGroup("Wait-Event", NULL, true, es);
+			ExplainPropertyText("Wait Event Type",
+								event_type ? event_type : "Unknown",
+								es);
+			ExplainPropertyText("Wait Event",
+								event_name ? event_name : "unknown",
+								es);
+			ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+			ExplainPropertyFloat("Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(entries[i].time),
+								 3, es);
+			ExplainCloseGroup("Wait-Event", NULL, true, es);
+		}
+
+		ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+	}
+
+	if (entries)
+		pfree(entries);
+}
+
 /*
  * Show memory usage details.
  */
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index a0ee0a664be..2abcb64cc95 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -164,6 +164,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 		}
 		else if (strcmp(opt->defname, "io") == 0)
 			es->io = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "waits") == 0)
+			es->waits = defGetBoolean(opt);
 		else if (!ApplyExtensionExplainOption(es, opt, pstate))
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -196,6 +198,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("EXPLAIN option %s requires ANALYZE", "IO")));
+	/* check that WAITS is used with EXPLAIN ANALYZE */
+	if (es->waits && !es->analyze)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("EXPLAIN option %s requires ANALYZE", "WAITS")));
+
 	/* check that serialize is used with EXPLAIN ANALYZE */
 	if (es->serialize != EXPLAIN_SERIALIZE_NONE && !es->analyze)
 		ereport(ERROR,
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 95635c7f56c..60d37ccbb73 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,11 +36,20 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
+static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+							  const instr_time *elapsed);

static uint32 local_my_wait_event_info;
uint32 *my_wait_event_info = &local_my_wait_event_info;

+#define WAIT_EVENT_USAGE_INITIAL_EVENTS	16
+
+int			pgstat_wait_event_usage_depth = 0;
+static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static uint32 pgstat_wait_event_usage_current = 0;
+static instr_time pgstat_wait_event_usage_start;
+
 #define WAIT_EVENT_CLASS_MASK	0xFF000000
 #define WAIT_EVENT_ID_MASK		0x0000FFFF

@@ -349,6 +358,146 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}

+/*
+ * Start collecting exact wait event timings in this backend.
+ *
+ * This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
+ * It records waits observed through pgstat_report_wait_start/end in backend
+ * local memory.  Nested collection is deliberately treated as part of the
+ * outer collection for now; callers that want independent nested accounting
+ * need a stack of WaitEventUsage contexts.
+ */
+void
+pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+	Assert(usage != NULL);
+	Assert(memcontext != NULL);
+
+	if (pgstat_wait_event_usage_depth++ == 0)
+	{
+		memset(usage, 0, sizeof(WaitEventUsage));
+		usage->memcontext = memcontext;
+		pgstat_wait_event_usage = usage;
+		pgstat_wait_event_usage_current = 0;
+		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+	}
+}
+
+/*
+ * Stop collecting wait event timings.
+ */
+void
+pgstat_end_wait_event_usage(WaitEventUsage *usage)
+{
+	Assert(usage != NULL);
+	Assert(pgstat_wait_event_usage_depth > 0);
+
+	if (--pgstat_wait_event_usage_depth == 0)
+	{
+		if (pgstat_wait_event_usage_current != 0)
+			pgstat_count_wait_event_end();
+
+		pgstat_wait_event_usage = NULL;
+		pgstat_wait_event_usage_current = 0;
+		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+	}
+}
+
+/*
+ * Record the beginning of a wait event for exact EXPLAIN-style accounting.
+ */
+void
+pgstat_count_wait_event_start(uint32 wait_event_info)
+{
+	if (pgstat_wait_event_usage == NULL)
+		return;
+
+	/*
+	 * Waits are not expected to nest.  If they do, finish the previous wait
+	 * at the boundary so accounting remains internally consistent.
+	 */
+	if (pgstat_wait_event_usage_current != 0)
+		pgstat_count_wait_event_end();
+
+	pgstat_wait_event_usage_current = wait_event_info;
+	INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
+}
+
+/*
+ * Record the end of the current wait event.
+ */
+void
+pgstat_count_wait_event_end(void)
+{
+	instr_time	end;
+	instr_time	elapsed;
+
+	if (pgstat_wait_event_usage == NULL ||
+		pgstat_wait_event_usage_current == 0)
+		return;
+
+	INSTR_TIME_SET_CURRENT(end);
+	elapsed = end;
+	INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);
+
+	WaitEventUsageAdd(pgstat_wait_event_usage,
+					  pgstat_wait_event_usage_current,
+					  &elapsed);
+
+	pgstat_wait_event_usage_current = 0;
+	INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+}
+
+static void
+WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+				  const instr_time *elapsed)
+{
+	WaitEventUsageEntry *entry = NULL;
+
+	for (int i = 0; i < usage->nentries; i++)
+	{
+		if (usage->entries[i].wait_event_info == wait_event_info)
+		{
+			entry = &usage->entries[i];
+			break;
+		}
+	}
+
+	if (entry == NULL)
+	{
+		if (usage->nentries >= usage->maxentries)
+		{
+			MemoryContext oldcontext;
+			int			newmaxentries;
+
+			if (usage->maxentries > 0)
+				newmaxentries = usage->maxentries * 2;
+			else
+				newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
+
+			oldcontext = MemoryContextSwitchTo(usage->memcontext);
+			if (usage->entries)
+				usage->entries = repalloc_array(usage->entries,
+												WaitEventUsageEntry,
+												newmaxentries);
+			else
+				usage->entries = palloc_array(WaitEventUsageEntry,
+											  newmaxentries);
+			MemoryContextSwitchTo(oldcontext);
+
+			usage->maxentries = newmaxentries;
+		}
+
+		entry = &usage->entries[usage->nentries++];
+		entry->wait_event_info = wait_event_info;
+		entry->calls = 0;
+		INSTR_TIME_SET_ZERO(entry->time);
+	}
+
+	entry->calls++;
+	INSTR_TIME_ADD(entry->time, *elapsed);
+}
+
 /* ----------
  * pgstat_get_wait_event_type() -
  *
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 97bc7ed49f6..4f0d0d495d4 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
 	bool		memory;			/* print planner's memory usage information */
 	bool		settings;		/* print modified settings */
 	bool		io;				/* print info about IO (prefetch, ...) */
+	bool		waits;			/* print wait event usage */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
 	ExplainFormat format;		/* output format */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 86ee348220d..f7fab5736bb 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -11,16 +11,39 @@
 #define WAIT_EVENT_H
 /* enums for wait events */
+#include "portability/instr_time.h"
+#include "utils/palloc.h"
 #include "utils/wait_event_types.h"
+typedef struct WaitEventUsageEntry
+{
+	uint32		wait_event_info;
+	uint64		calls;
+	instr_time	time;
+} WaitEventUsageEntry;
+
+typedef struct WaitEventUsage
+{
+	MemoryContext memcontext;
+	int			nentries;
+	int			maxentries;
+	WaitEventUsageEntry *entries;
+} WaitEventUsage;
+
 extern const char *pgstat_get_wait_event(uint32 wait_event_info);
 extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
 static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
+										  MemoryContext memcontext);
+extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_count_wait_event_start(uint32 wait_event_info);
+extern void pgstat_count_wait_event_end(void);

extern PGDLLIMPORT uint32 *my_wait_event_info;
+extern PGDLLIMPORT int pgstat_wait_event_usage_depth;

 /*
@@ -66,6 +89,9 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
 static inline void
 pgstat_report_wait_start(uint32 wait_event_info)
 {
+	if (pgstat_wait_event_usage_depth > 0)
+		pgstat_count_wait_event_start(wait_event_info);
+
 	/*
 	 * Since this is a four-byte field which is always read and written as
 	 * four-bytes, updates are atomic.
@@ -82,6 +108,9 @@ pgstat_report_wait_start(uint32 wait_event_info)
 static inline void
 pgstat_report_wait_end(void)
 {
+	if (pgstat_wait_event_usage_depth > 0)
+		pgstat_count_wait_event_end();
+
 	/* see pgstat_report_wait_start() */
 	*(volatile uint32 *) my_wait_event_info = 0;
 }
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 74a4d87801e..2c7a7e1d4c6 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -99,6 +99,23 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
  Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
 (1 row)
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+            explain_filter             
+---------------------------------------
+ Result (actual rows=N.N loops=N)
+ Wait Events:
+   Timeout:PgSleep calls=N time=N.N ms
+(3 rows)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+explain (waits) select 1;
+ERROR:  EXPLAIN option WAITS requires ANALYZE
 \a
 select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
 explain_filter
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 2f163c64bf6..fe025ddeac5 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -68,6 +68,11 @@ select explain_filter('explain (analyze, buffers off, verbose) select * from int
 select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+explain (waits) select 1;
+
 \a
 select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
 select explain_filter('explain (analyze, serialize, buffers, io, format yaml) select * from int8_tbl i8');
-- 
2.52.0
#3Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#1)
[RFC PATCH v0 2/7] Aggregate EXPLAIN WAITS from parallel workers

---
src/backend/commands/explain.c | 4 +
src/backend/executor/execMain.c | 1 +
src/backend/executor/execParallel.c | 129 ++++++++++++++++++++++++
src/backend/executor/execUtils.c | 1 +
src/backend/utils/activity/wait_event.c | 22 +++-
src/include/executor/execParallel.h | 2 +
src/include/nodes/execnodes.h | 2 +
src/include/utils/wait_event.h | 3 +
src/test/regress/expected/explain.out | 17 ++++
src/test/regress/sql/explain.sql | 12 +++
10 files changed, 190 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0b7cc5c15c6..9d7372f5415 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -594,6 +594,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			waitEventUsagePtr = &waitEventUsage;
 			pgstat_begin_wait_event_usage(waitEventUsagePtr,
 										  queryDesc->estate->es_query_cxt);
+			queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
 		}
 		/* run the plan */
@@ -607,7 +608,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		PG_FINALLY();
 		{
 			if (waitEventUsagePtr)
+			{
 				pgstat_end_wait_event_usage(waitEventUsagePtr);
+				queryDesc->estate->es_wait_event_usage = NULL;
+			}
 		}
 		PG_END_TRY();
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b30f768680..86ab124c1c0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3066,6 +3066,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
 	/* es_trig_target_relations must NOT be copied */
 	rcestate->es_top_eflags = parentestate->es_top_eflags;
 	rcestate->es_instrument = parentestate->es_instrument;
+	rcestate->es_wait_event_usage = parentestate->es_wait_event_usage;
 	/* es_auxmodifytables must NOT be copied */
 	/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 81b87d82fab..8213565a708 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -51,6 +51,7 @@
 #include "utils/dsa.h"
 #include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
+#include "utils/wait_event.h"
 /*
  * Magic numbers for parallel executor communication.  We use constants
@@ -67,6 +68,7 @@
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_WAIT_EVENT_USAGE	UINT64CONST(0xE00000000000000B)

#define PARALLEL_TUPLE_QUEUE_SIZE 65536

@@ -114,6 +116,18 @@ struct SharedExecutorInstrumentation
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
(NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))

+typedef struct SharedWaitEventUsageWorker
+{
+	int			nentries;
+	dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
+struct SharedWaitEventUsage
+{
+	int			num_workers;
+	SharedWaitEventUsageWorker worker_usage[FLEXIBLE_ARRAY_MEMBER];
+};
+
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
 {
@@ -141,6 +155,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
 										ParallelContext *pcxt);
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
 												SharedExecutorInstrumentation *instrumentation);
+static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+											 dsa_area *area,
+											 const WaitEventUsage *usage);
 /* Helper function that runs in the parallel worker. */
 static DestReceiver *ExecParallelGetReceiver(dsm_segment *seg, shm_toc *toc);
@@ -664,10 +682,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	WalUsage   *walusage_space;
+	SharedWaitEventUsage *wait_event_usage = NULL;
 	SharedExecutorInstrumentation *instrumentation = NULL;
 	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
 	int			paramlistinfo_len;
+	int			wait_event_usage_len = 0;
 	int			instrumentation_len = 0;
 	int			jit_instrumentation_len = 0;
 	int			instrument_offset = 0;
@@ -744,6 +764,16 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	/* Estimate space for per-worker wait event usage metadata. */
+	if (estate->es_wait_event_usage != NULL)
+	{
+		wait_event_usage_len =
+			offsetof(SharedWaitEventUsage, worker_usage) +
+			mul_size(sizeof(SharedWaitEventUsageWorker), pcxt->nworkers);
+		shm_toc_estimate_chunk(&pcxt->estimator, wait_event_usage_len);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
 	/* Estimate space for tuple queues. */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -839,6 +869,21 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
 	pei->wal_usage = walusage_space;
+	/* Allocate metadata for each worker's wait event usage, if requested. */
+	if (estate->es_wait_event_usage != NULL)
+	{
+		wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
+		wait_event_usage->num_workers = nworkers;
+		for (int i = 0; i < nworkers; i++)
+		{
+			wait_event_usage->worker_usage[i].nentries = 0;
+			wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
+					   wait_event_usage);
+		pei->wait_event_usage = wait_event_usage;
+	}
+
 	/* Set up the tuple queues that the workers will write into. */
 	pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);

@@ -1213,6 +1258,68 @@ ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
memcpy(planstate->worker_jit_instrument, shared_jit, ibytes);
}

+static void
+ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
+{
+	SharedWaitEventUsage *shared = pei->wait_event_usage;
+	WaitEventUsage *usage;
+
+	if (shared == NULL)
+		return;
+
+	usage = pei->planstate->state->es_wait_event_usage;
+	if (usage == NULL)
+		return;
+
+	for (int i = 0; i < shared->num_workers; i++)
+	{
+		SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
+		WaitEventUsageEntry *entries;
+
+		if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+			continue;
+
+		entries = dsa_get_address(pei->area, worker->entries);
+		pgstat_accumulate_wait_event_usage(usage,
+										   entries,
+										   worker->nentries);
+		dsa_free(pei->area, worker->entries);
+		worker->nentries = 0;
+		worker->entries = InvalidDsaPointer;
+	}
+}
+
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+								 dsa_area *area,
+								 const WaitEventUsage *usage)
+{
+	SharedWaitEventUsageWorker *worker;
+	WaitEventUsageEntry *entries;
+	dsa_pointer entries_dsa;
+	Size		entries_size;
+
+	Assert(shared != NULL);
+	Assert(area != NULL);
+	Assert(usage != NULL);
+	Assert(IsParallelWorker());
+	Assert(ParallelWorkerNumber < shared->num_workers);
+
+	if (usage->nentries <= 0)
+		return;
+
+	worker = &shared->worker_usage[ParallelWorkerNumber];
+	entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+	entries_dsa = dsa_allocate(area, entries_size);
+	entries = dsa_get_address(area, entries_dsa);
+	memcpy(entries, usage->entries, entries_size);
+
+	if (DsaPointerIsValid(worker->entries))
+		dsa_free(area, worker->entries);
+	worker->nentries = usage->nentries;
+	worker->entries = entries_dsa;
+}
+
 /*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer/WAL usage.
@@ -1261,6 +1368,9 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 	for (i = 0; i < nworkers; i++)
 		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+	/* Accumulate wait event usage, if requested. */
+	ExecParallelRetrieveWaitEventUsage(pei);
+
 	pei->finished = true;
 }
@@ -1516,10 +1626,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	FixedParallelExecutorState *fpes;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	SharedWaitEventUsage *wait_event_usage;
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
 	SharedJitInstrumentation *jit_instrumentation;
+	WaitEventUsage waitEventUsage;
+	WaitEventUsage *waitEventUsagePtr = NULL;
 	int			instrument_options = 0;
 	void	   *area_space;
 	dsa_area   *area;
@@ -1535,6 +1648,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		instrument_options = instrumentation->instrument_options;
 	jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
 										 true);
+	wait_event_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAIT_EVENT_USAGE, true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);

/* Setting debug_query_string for individual workers */
@@ -1576,6 +1690,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
*/
InstrStartParallelQuery();

+	if (wait_event_usage != NULL)
+	{
+		waitEventUsagePtr = &waitEventUsage;
+		pgstat_begin_wait_event_usage(waitEventUsagePtr,
+									  queryDesc->estate->es_query_cxt);
+	}
+
 	/*
 	 * Run the plan.  If we specified a tuple bound, be careful not to demand
 	 * more tuples than that.
@@ -1587,6 +1708,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Shut down the executor */
 	ExecutorFinish(queryDesc);
+	if (waitEventUsagePtr != NULL)
+	{
+		pgstat_end_wait_event_usage(waitEventUsagePtr);
+		ExecParallelReportWaitEventUsage(wait_event_usage,
+										 area,
+										 waitEventUsagePtr);
+	}
+
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..80ea777632b 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -151,6 +151,7 @@ CreateExecutorState(void)

estate->es_top_eflags = 0;
estate->es_instrument = 0;
+ estate->es_wait_event_usage = NULL;
estate->es_finished = false;

 	estate->es_exprcontexts = NIL;
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60d37ccbb73..eb01bc3d88c 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -37,7 +37,7 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
-							  const instr_time *elapsed);
+							  uint64 calls, const instr_time *elapsed);

static uint32 local_my_wait_event_info;
@@ -442,15 +442,31 @@ pgstat_count_wait_event_end(void)

WaitEventUsageAdd(pgstat_wait_event_usage,
pgstat_wait_event_usage_current,
+ 1,
&elapsed);

pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}

+void
+pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+								   const WaitEventUsageEntry *entries,
+								   int nentries)
+{
+	Assert(usage != NULL);
+	Assert(nentries == 0 || entries != NULL);
+
+	for (int i = 0; i < nentries; i++)
+		WaitEventUsageAdd(usage,
+						  entries[i].wait_event_info,
+						  entries[i].calls,
+						  &entries[i].time);
+}
+
 static void
 WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
-				  const instr_time *elapsed)
+				  uint64 calls, const instr_time *elapsed)
 {
 	WaitEventUsageEntry *entry = NULL;

@@ -494,7 +510,7 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
INSTR_TIME_SET_ZERO(entry->time);
}

-	entry->calls++;
+	entry->calls += calls;
 	INSTR_TIME_ADD(entry->time, *elapsed);
 }
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..71df2c2511c 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -20,6 +20,7 @@
 #include "utils/dsa.h"

typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
+typedef struct SharedWaitEventUsage SharedWaitEventUsage;

 typedef struct ParallelExecutorInfo
 {
@@ -27,6 +28,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	WalUsage   *wal_usage;		/* walusage area in DSM */
+	SharedWaitEventUsage *wait_event_usage;	/* optional */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
 	struct SharedJitInstrumentation *jit_instrumentation;	/* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 13359180d25..781c8c79132 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -69,6 +69,7 @@ typedef struct Tuplestorestate Tuplestorestate;
 typedef struct TupleTableSlot TupleTableSlot;
 typedef struct TupleTableSlotOps TupleTableSlotOps;
 typedef struct WalUsage WalUsage;
+typedef struct WaitEventUsage WaitEventUsage;
 typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;

@@ -754,6 +755,7 @@ typedef struct EState

int es_top_eflags; /* eflags passed to ExecutorStart */
int es_instrument; /* OR of InstrumentOption flags */
+ WaitEventUsage *es_wait_event_usage; /* EXPLAIN WAITS accumulator */
bool es_finished; /* true when ExecutorFinish is done */

 	List	   *es_exprcontexts;	/* List of ExprContexts within EState */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f7fab5736bb..63992137ee7 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -39,6 +39,9 @@ extern void pgstat_reset_wait_event_storage(void);
 extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
 										  MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+											   const WaitEventUsageEntry *entries,
+											   int nentries);
 extern void pgstat_count_wait_event_start(uint32 wait_event_info);
 extern void pgstat_count_wait_event_end(void);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 2c7a7e1d4c6..e3847e222be 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -114,6 +114,23 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+  language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+  '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
 explain (waits) select 1;
 ERROR:  EXPLAIN option WAITS requires ANALYZE
 \a
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index fe025ddeac5..8821250bcef 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,18 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
 select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+  language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+  '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
 explain (waits) select 1;

\a
--
2.52.0

#4Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#1)
[RFC PATCH v0 3/7] Attribute EXPLAIN WAITS to plan nodes

---
src/backend/commands/explain.c | 6 +-
src/backend/executor/execAsync.c | 22 ++++
src/backend/executor/execParallel.c | 131 ++++++++++++++++++---
src/backend/executor/execProcnode.c | 31 ++++-
src/backend/executor/instrument.c | 7 ++
src/backend/executor/nodeBitmapAnd.c | 7 ++
src/backend/executor/nodeBitmapIndexscan.c | 7 ++
src/backend/executor/nodeBitmapOr.c | 7 ++
src/backend/executor/nodeHash.c | 7 ++
src/backend/utils/activity/wait_event.c | 49 +++++++-
src/include/executor/instrument.h | 1 +
src/include/nodes/execnodes.h | 1 +
src/include/utils/wait_event.h | 4 +
src/test/regress/expected/explain.out | 52 +++++++-
src/test/regress/sql/explain.sql | 29 +++++
15 files changed, 333 insertions(+), 28 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9d7372f5415..0575ae8368b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -530,6 +530,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		instrument_option |= INSTRUMENT_WAL;
 	if (es->io)
 		instrument_option |= INSTRUMENT_IO;
+	if (es->waits)
+		instrument_option |= INSTRUMENT_WAITS;
 	/*
 	 * We always collect timing for the entire statement, even when node-level
@@ -2332,6 +2334,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		show_buffer_usage(es, &planstate->instrument->instr.bufusage);
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->instr.walusage);
+	if (es->waits)
+		show_wait_event_usage(es, planstate->wait_event_usage);

/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4559,7 +4563,7 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
if (usage == NULL)
return;

-	if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+	if (usage->nentries == 0)
 		return;
 	if (usage->nentries > 0)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index cf7ddbb01f4..e73ea55cf92 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -19,6 +19,7 @@
 #include "executor/instrument.h"
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
+#include "utils/wait_event.h"
 /*
  * Asynchronously request a tuple from a designed async-capable node.
@@ -26,12 +27,17 @@
 void
 ExecAsyncRequest(AsyncRequest *areq)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	if (areq->requestee->chgParam != NULL)	/* something changed? */
 		ExecReScan(areq->requestee);	/* let ReScan handle this */
 	/* must provide our own instrumentation support */
 	if (areq->requestee->instrument)
 		InstrStartNode(areq->requestee->instrument);
+	if (areq->requestee->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);

switch (nodeTag(areq->requestee))
{
@@ -47,6 +53,8 @@ ExecAsyncRequest(AsyncRequest *areq)
ExecAsyncResponse(areq);

 	/* must provide our own instrumentation support */
+	if (areq->requestee->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (areq->requestee->instrument)
 		InstrStopNode(areq->requestee->instrument,
 					  TupIsNull(areq->result) ? 0.0 : 1.0);
@@ -62,9 +70,14 @@ ExecAsyncRequest(AsyncRequest *areq)
 void
 ExecAsyncConfigureWait(AsyncRequest *areq)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	/* must provide our own instrumentation support */
 	if (areq->requestee->instrument)
 		InstrStartNode(areq->requestee->instrument);
+	if (areq->requestee->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);

switch (nodeTag(areq->requestee))
{
@@ -78,6 +91,8 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
}

 	/* must provide our own instrumentation support */
+	if (areq->requestee->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (areq->requestee->instrument)
 		InstrStopNode(areq->requestee->instrument, 0.0);
 }
@@ -88,9 +103,14 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
 void
 ExecAsyncNotify(AsyncRequest *areq)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	/* must provide our own instrumentation support */
 	if (areq->requestee->instrument)
 		InstrStartNode(areq->requestee->instrument);
+	if (areq->requestee->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);

switch (nodeTag(areq->requestee))
{
@@ -106,6 +126,8 @@ ExecAsyncNotify(AsyncRequest *areq)
ExecAsyncResponse(areq);

 	/* must provide our own instrumentation support */
+	if (areq->requestee->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (areq->requestee->instrument)
 		InstrStopNode(areq->requestee->instrument,
 					  TupIsNull(areq->result) ? 0.0 : 1.0);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8213565a708..f77f539dd27 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -83,6 +83,12 @@ typedef struct FixedParallelExecutorState
 	int			jit_flags;
 } FixedParallelExecutorState;
+typedef struct SharedWaitEventUsageWorker
+{
+	int			nentries;
+	dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
 /*
  * DSM structure for accumulating per-PlanState instrumentation.
  *
@@ -92,6 +98,10 @@ typedef struct FixedParallelExecutorState
  * of the first NodeInstrumentation object.  This will depend on the length of
  * the plan_node_id array.
  *
+ * wait_event_usage_offset: Offset, relative to the start of this structure,
+ * of the first SharedWaitEventUsageWorker object, or 0 if wait event usage is
+ * not being collected.
+ *
  * num_workers: Number of workers.
  *
  * num_plan_nodes: Number of plan nodes.
@@ -103,6 +113,7 @@ struct SharedExecutorInstrumentation
 {
 	int			instrument_options;
 	int			instrument_offset;
+	int			wait_event_usage_offset;
 	int			num_workers;
 	int			num_plan_nodes;
 	int			plan_node_id[FLEXIBLE_ARRAY_MEMBER];
@@ -110,17 +121,17 @@ struct SharedExecutorInstrumentation
 	/*
 	 * Array of num_plan_nodes * num_workers NodeInstrumentation objects
 	 * follows.
+	 *
+	 * If wait_event_usage_offset is non-zero, an array of num_plan_nodes *
+	 * num_workers SharedWaitEventUsageWorker objects follows.
 	 */
 };
 #define GetInstrumentationArray(sei) \
 	(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
 	 (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
-
-typedef struct SharedWaitEventUsageWorker
-{
-	int			nentries;
-	dsa_pointer entries;
-} SharedWaitEventUsageWorker;
+#define GetInstrumentationWaitEventUsageArray(sei) \
+	(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
+	 (SharedWaitEventUsageWorker *) (((char *) sei) + sei->wait_event_usage_offset))

struct SharedWaitEventUsage
{
@@ -143,6 +154,12 @@ typedef struct ExecParallelInitializeDSMContext
int nnodes;
} ExecParallelInitializeDSMContext;

+typedef struct ExecParallelRetrieveInstrumentationContext
+{
+	SharedExecutorInstrumentation *instrumentation;
+	dsa_area   *area;
+} ExecParallelRetrieveInstrumentationContext;
+
 /* Helper functions that run in the parallel leader. */
 static char *ExecSerializePlan(Plan *plan, EState *estate);
 static bool ExecParallelEstimate(PlanState *planstate,
@@ -154,8 +171,11 @@ static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt,
 static bool ExecParallelReInitializeDSM(PlanState *planstate,
 										ParallelContext *pcxt);
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
-												SharedExecutorInstrumentation *instrumentation);
+												ExecParallelRetrieveInstrumentationContext *r);
 static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+												   dsa_area *area,
+												   const WaitEventUsage *usage);
 static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
 											 dsa_area *area,
 											 const WaitEventUsage *usage);
@@ -691,6 +711,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	int			instrumentation_len = 0;
 	int			jit_instrumentation_len = 0;
 	int			instrument_offset = 0;
+	int			wait_event_usage_offset = 0;
 	Size		dsa_minsize = dsa_minimum_size();
 	char	   *query_string;
 	int			query_len;
@@ -798,6 +819,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation_len +=
 			mul_size(sizeof(NodeInstrumentation),
 					 mul_size(e.nnodes, nworkers));
+		if (estate->es_instrument & INSTRUMENT_WAITS)
+		{
+			instrumentation_len = MAXALIGN(instrumentation_len);
+			wait_event_usage_offset = instrumentation_len;
+			instrumentation_len +=
+				mul_size(sizeof(SharedWaitEventUsageWorker),
+						 mul_size(e.nnodes, nworkers));
+		}
 		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -903,11 +932,23 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
 		instrumentation->instrument_options = estate->es_instrument;
 		instrumentation->instrument_offset = instrument_offset;
+		instrumentation->wait_event_usage_offset = wait_event_usage_offset;
 		instrumentation->num_workers = nworkers;
 		instrumentation->num_plan_nodes = e.nnodes;
 		instrument = GetInstrumentationArray(instrumentation);
 		for (i = 0; i < nworkers * e.nnodes; ++i)
 			InstrInitNode(&instrument[i], estate->es_instrument, false);
+		if (wait_event_usage_offset != 0)
+		{
+			SharedWaitEventUsageWorker *worker_usage;
+
+			worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+			for (i = 0; i < nworkers * e.nnodes; ++i)
+			{
+				worker_usage[i].nentries = 0;
+				worker_usage[i].entries = InvalidDsaPointer;
+			}
+		}
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
 		pei->instrumentation = instrumentation;
@@ -1137,9 +1178,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
  */
 static bool
 ExecParallelRetrieveInstrumentation(PlanState *planstate,
-									SharedExecutorInstrumentation *instrumentation)
+									ExecParallelRetrieveInstrumentationContext *r)
 {
+	SharedExecutorInstrumentation *instrumentation = r->instrumentation;
 	NodeInstrumentation *instrument;
+	SharedWaitEventUsageWorker *wait_event_usage = NULL;
 	int			i;
 	int			n;
 	int			ibytes;
@@ -1159,6 +1202,30 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	for (n = 0; n < instrumentation->num_workers; ++n)
 		InstrAggNode(planstate->instrument, &instrument[n]);
+	/* Accumulate the wait event usage from all workers. */
+	if (instrumentation->wait_event_usage_offset != 0 &&
+		planstate->wait_event_usage != NULL)
+	{
+		wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+		wait_event_usage += i * instrumentation->num_workers;
+		for (n = 0; n < instrumentation->num_workers; ++n)
+		{
+			SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
+			WaitEventUsageEntry *entries;
+
+			if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+				continue;
+
+			entries = dsa_get_address(r->area, worker->entries);
+			pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
+											   entries,
+											   worker->nentries);
+			dsa_free(r->area, worker->entries);
+			worker->nentries = 0;
+			worker->entries = InvalidDsaPointer;
+		}
+	}
+
 	/*
 	 * Also store the per-worker detail.
 	 *
@@ -1216,7 +1283,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	}
 	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
-								 instrumentation);
+								 r);
 }

/*
@@ -1290,25 +1357,21 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
}

 static void
-ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
-								 dsa_area *area,
-								 const WaitEventUsage *usage)
+ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+									   dsa_area *area,
+									   const WaitEventUsage *usage)
 {
-	SharedWaitEventUsageWorker *worker;
 	WaitEventUsageEntry *entries;
 	dsa_pointer entries_dsa;
 	Size		entries_size;
-	Assert(shared != NULL);
+	Assert(worker != NULL);
 	Assert(area != NULL);
 	Assert(usage != NULL);
-	Assert(IsParallelWorker());
-	Assert(ParallelWorkerNumber < shared->num_workers);

if (usage->nentries <= 0)
return;

-	worker = &shared->worker_usage[ParallelWorkerNumber];
 	entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
 	entries_dsa = dsa_allocate(area, entries_size);
 	entries = dsa_get_address(area, entries_dsa);
@@ -1320,6 +1383,20 @@ ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
 	worker->entries = entries_dsa;
 }
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+								 dsa_area *area,
+								 const WaitEventUsage *usage)
+{
+	Assert(shared != NULL);
+	Assert(IsParallelWorker());
+	Assert(ParallelWorkerNumber < shared->num_workers);
+
+	ExecParallelReportWaitEventUsageWorker(&shared->worker_usage[ParallelWorkerNumber],
+										   area,
+										   usage);
+}
+
 /*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer/WAL usage.
@@ -1385,8 +1462,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
 {
 	/* Accumulate instrumentation, if any. */
 	if (pei->instrumentation)
-		ExecParallelRetrieveInstrumentation(pei->planstate,
-											pei->instrumentation);
+	{
+		ExecParallelRetrieveInstrumentationContext r;
+
+		r.instrumentation = pei->instrumentation;
+		r.area = pei->area;
+		ExecParallelRetrieveInstrumentation(pei->planstate, &r);
+	}
 	/* Accumulate JIT instrumentation, if any. */
 	if (pei->jit_instrumentation)
@@ -1495,6 +1577,17 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 	Assert(IsParallelWorker());
 	Assert(ParallelWorkerNumber < instrumentation->num_workers);
 	InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
+	if (instrumentation->wait_event_usage_offset != 0 &&
+		planstate->wait_event_usage != NULL)
+	{
+		SharedWaitEventUsageWorker *wait_event_usage;
+
+		wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+		wait_event_usage += i * instrumentation->num_workers;
+		ExecParallelReportWaitEventUsageWorker(&wait_event_usage[ParallelWorkerNumber],
+											   planstate->state->es_query_dsa,
+											   planstate->wait_event_usage);
+	}
 	return planstate_tree_walker(planstate, ExecParallelReportInstrumentation,
 								 instrumentation);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7c4c66e323f..081855b3fed 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
 #include "miscadmin.h"
+#include "utils/wait_event.h"
 #include "nodes/nodeFuncs.h"
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
@@ -415,6 +416,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 	if (estate->es_instrument)
 		result->instrument = InstrAllocNode(estate->es_instrument,
 											result->async_capable);
+	if (estate->es_instrument & INSTRUMENT_WAITS)
+	{
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+		result->wait_event_usage = palloc_object(WaitEventUsage);
+		pgstat_init_wait_event_usage(result->wait_event_usage,
+									 estate->es_query_cxt);
+		MemoryContextSwitchTo(oldcontext);
+	}
 	return result;
 }
@@ -481,7 +492,8 @@ ExecProcNodeFirst(PlanState *node)
  * This has essentially the same responsibilities as ExecProcNode,
  * but it does not do InstrStartNode/InstrStopNode (mainly because
  * it can't tell how many returned tuples to count).  Each per-node
- * function must provide its own instrumentation support.
+ * function must provide its own instrumentation support, including
+ * wait event attribution when enabled.
  * ----------------------------------------------------------------
  */
 Node *
@@ -758,6 +770,9 @@ ExecShutdownNode(PlanState *node)
 static bool
 ExecShutdownNode_walker(PlanState *node, void *context)
 {
+	bool		node_running;
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	if (node == NULL)
 		return false;
@@ -773,8 +788,14 @@ ExecShutdownNode_walker(PlanState *node, void *context)
 	 * has never been executed, so as to avoid incorrectly making it appear
 	 * that it has.
 	 */
-	if (node->instrument && node->instrument->running)
+	node_running = node->instrument && node->instrument->running;
+	if (node_running)
+	{
 		InstrStartNode(node->instrument);
+		if (node->wait_event_usage)
+			previous_wait_event_usage =
+				pgstat_enter_wait_event_usage(node->wait_event_usage);
+	}

planstate_tree_walker(node, ExecShutdownNode_walker, context);

@@ -803,8 +824,12 @@ ExecShutdownNode_walker(PlanState *node, void *context)
}

 	/* Stop the node if we started it above, reporting 0 tuples. */
-	if (node->instrument && node->instrument->running)
+	if (node_running)
+	{
+		if (node->wait_event_usage)
+			pgstat_restore_wait_event_usage(previous_wait_event_usage);
 		InstrStopNode(node->instrument, 0);
+	}
 	return false;
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ffbcd572133..a1f22180f82 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "portability/instr_time.h"
 #include "utils/guc_hooks.h"
+#include "utils/wait_event.h"
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
@@ -181,11 +182,17 @@ TupleTableSlot *
 ExecProcNodeInstr(PlanState *node)
 {
 	TupleTableSlot *result;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 	InstrStartNode(node->instrument);
+	if (node->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->wait_event_usage);

result = node->ExecProcNodeReal(node);

+	if (node->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 	return result;
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index 9007dda3802..e2b442bd378 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -32,6 +32,7 @@
 #include "executor/instrument.h"
 #include "executor/nodeBitmapAnd.h"
 #include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
 /* ----------------------------------------------------------------
@@ -115,10 +116,14 @@ MultiExecBitmapAnd(BitmapAndState *node)
 	int			nplans;
 	int			i;
 	TIDBitmap  *result = NULL;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
+	if (node->ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ps.wait_event_usage);

/*
* get information from the node
@@ -162,6 +167,8 @@ MultiExecBitmapAnd(BitmapAndState *node)
elog(ERROR, "BitmapAnd doesn't support zero inputs");

 	/* must provide our own instrumentation support */
+	if (node->ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 7978514e1bc..c2798a81e97 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -28,6 +28,7 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
 /* ----------------------------------------------------------------
@@ -54,10 +55,14 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	IndexScanDesc scandesc;
 	double		nTuples = 0;
 	bool		doscan;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 	/* must provide our own instrumentation support */
 	if (node->ss.ps.instrument)
 		InstrStartNode(node->ss.ps.instrument);
+	if (node->ss.ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ss.ps.wait_event_usage);

/*
* extract necessary information from index scan node
@@ -116,6 +121,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
}

 	/* must provide our own instrumentation support */
+	if (node->ss.ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ss.ps.instrument)
 		InstrStopNode(node->ss.ps.instrument, nTuples);
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 148c80fdae6..e9ee934a6f6 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -33,6 +33,7 @@
 #include "executor/nodeBitmapOr.h"
 #include "nodes/tidbitmap.h"
 #include "miscadmin.h"
+#include "utils/wait_event.h"
 /* ----------------------------------------------------------------
@@ -116,10 +117,14 @@ MultiExecBitmapOr(BitmapOrState *node)
 	int			nplans;
 	int			i;
 	TIDBitmap  *result = NULL;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
+	if (node->ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ps.wait_event_usage);

/*
* get information from the node
@@ -180,6 +185,8 @@ MultiExecBitmapOr(BitmapOrState *node)
elog(ERROR, "BitmapOr doesn't support zero inputs");

 	/* must provide our own instrumentation support */
+	if (node->ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8825bb6fa23..6e9e4b4218f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -105,9 +105,14 @@ ExecHash(PlanState *pstate)
 Node *
 MultiExecHash(HashState *node)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
+	if (node->ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ps.wait_event_usage);

if (node->parallel_state != NULL)
MultiExecParallelHash(node);
@@ -115,6 +120,8 @@ MultiExecHash(HashState *node)
MultiExecPrivateHash(node);

 	/* must provide our own instrumentation support */
+	if (node->ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, node->hashtable->reportTuples);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index eb01bc3d88c..3d5a10b44c4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,6 +27,7 @@
 #include "storage/shmem.h"
 #include "storage/subsystems.h"
 #include "storage/spin.h"
+#include "utils/memutils.h"
 #include "utils/wait_event.h"

@@ -47,6 +48,7 @@ uint32 *my_wait_event_info = &local_my_wait_event_info;

int pgstat_wait_event_usage_depth = 0;
static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
static uint32 pgstat_wait_event_usage_current = 0;
static instr_time pgstat_wait_event_usage_start;

@@ -358,6 +360,28 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}

+/*
+ * Initialize a wait event usage accumulator.
+ */
+void
+pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+	Assert(usage != NULL);
+	Assert(memcontext != NULL);
+
+	memset(usage, 0, sizeof(WaitEventUsage));
+
+	/*
+	 * Wait events may end inside critical sections, for example while
+	 * performing synchronous I/O.  Keep usage entries in a dedicated context
+	 * where the memory manager permits that accounting path to grow.
+	 */
+	usage->memcontext = AllocSetContextCreate(memcontext,
+											  "Wait Event Usage",
+											  ALLOCSET_SMALL_SIZES);
+	MemoryContextAllowInCriticalSection(usage->memcontext, true);
+}
+
 /*
  * Start collecting exact wait event timings in this backend.
  *
@@ -375,9 +399,9 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 	if (pgstat_wait_event_usage_depth++ == 0)
 	{
-		memset(usage, 0, sizeof(WaitEventUsage));
-		usage->memcontext = memcontext;
+		pgstat_init_wait_event_usage(usage, memcontext);
 		pgstat_wait_event_usage = usage;
+		pgstat_wait_event_node_usage = NULL;
 		pgstat_wait_event_usage_current = 0;
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
@@ -398,11 +422,27 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
 			pgstat_count_wait_event_end();

pgstat_wait_event_usage = NULL;
+ pgstat_wait_event_node_usage = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
}

+WaitEventUsage *
+pgstat_enter_wait_event_usage(WaitEventUsage *usage)
+{
+	WaitEventUsage *previous = pgstat_wait_event_node_usage;
+
+	pgstat_wait_event_node_usage = usage;
+	return previous;
+}
+
+void
+pgstat_restore_wait_event_usage(WaitEventUsage *usage)
+{
+	pgstat_wait_event_node_usage = usage;
+}
+
 /*
  * Record the beginning of a wait event for exact EXPLAIN-style accounting.
  */
@@ -444,6 +484,11 @@ pgstat_count_wait_event_end(void)
 					  pgstat_wait_event_usage_current,
 					  1,
 					  &elapsed);
+	if (pgstat_wait_event_node_usage != NULL)
+		WaitEventUsageAdd(pgstat_wait_event_node_usage,
+						  pgstat_wait_event_usage_current,
+						  1,
+						  &elapsed);
 	pgstat_wait_event_usage_current = 0;
 	INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f093a52aae0..f46ee03d2c1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -65,6 +65,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
 	INSTRUMENT_IO = 1 << 4,		/* needs IO usage */
+	INSTRUMENT_WAITS = 1 << 5,	/* needs wait event usage */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 781c8c79132..56f9dcbef88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1212,6 +1212,7 @@ typedef struct PlanState

NodeInstrumentation *instrument; /* Optional runtime stats for this
* node */
+ WaitEventUsage *wait_event_usage; /* Optional wait event usage */
WorkerNodeInstrumentation *worker_instrument; /* per-worker
* instrumentation */

diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 63992137ee7..64411288403 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -36,12 +36,16 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
+										 MemoryContext memcontext);
 extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
 										  MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
 											   const WaitEventUsageEntry *entries,
 											   int nentries);
+extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_count_wait_event_start(uint32 wait_event_info);
 extern void pgstat_count_wait_event_end(void);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index e3847e222be..c086fa48d80 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -101,12 +101,14 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
-            explain_filter             
----------------------------------------
+             explain_filter              
+-----------------------------------------
  Result (actual rows=N.N loops=N)
+   Wait Events:
+     Timeout:PgSleep calls=N time=N.N ms
  Wait Events:
    Timeout:PgSleep calls=N time=N.N ms
-(3 rows)
+(5 rows)

select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
?column?
@@ -114,6 +116,12 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)

+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
 begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
@@ -130,6 +138,44 @@ select jsonb_path_query_first(
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Plan}',
+  '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+  language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
 rollback;
 explain (waits) select 1;
 ERROR:  EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 8821250bcef..bd196f248db 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,7 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
 select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
 begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
@@ -82,6 +83,34 @@ select jsonb_path_query_first(
                          from tenk1 where unique1 = 1') #> '{0,Wait Events}',
   '$[*] ? (@."Wait Event" == "PgSleep")'
 );
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Plan}',
+  '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+  language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
 rollback;
 explain (waits) select 1;

--
2.52.0

#5Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#1)
[RFC PATCH v0 4/7] Refine EXPLAIN WAITS attribution semantics

---
doc/src/sgml/ref/explain.sgml | 52 ++++++-
src/backend/commands/explain.c | 82 +++++++----
src/backend/executor/execParallel.c | 121 +++++++++++-----
src/backend/utils/activity/wait_event.c | 184 +++++++++++++++++++-----
src/bin/psql/tab-complete.in.c | 4 +-
src/include/utils/wait_event.h | 5 +
src/test/regress/expected/explain.out | 128 ++++++++++++++++-
src/test/regress/sql/explain.sql | 102 ++++++++++++-
8 files changed, 574 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5460568af1d..d699b215120 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -315,9 +315,46 @@ ROLLBACK;
     <term><literal>WAITS</literal></term>
     <listitem>
      <para>
-      Include exact backend-local wait event timing observed during query
-      execution.  Wait events are grouped by wait event type and wait event
-      name, and include the number of waits and total time spent waiting.
+      Include wait event timing observed during query execution.  Wait
+      events are grouped by wait event type and wait event name, using the
+      same wait event identifiers reported by
+      <structname>pg_stat_activity</structname>; see
+      <xref linkend="wait-event-table"/>.  Each entry includes the number
+      of completed waits and the total time spent waiting.  Wait timing is
+      collected when this option is enabled even if <literal>TIMING</literal>
+      is disabled.
+     </para>
+
+     <para>
+      If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
+      event storage without risking an error while a wait is ending, waits
+      whose exact event identifier could not be stored are accumulated in an
+      <literal>Unrecorded Wait Event Calls</literal> counter and
+      <literal>Unrecorded Wait Event Time</literal> total.  This is a
+      reporting fallback under memory pressure, not a wait event emitted by
+      server instrumentation.
+     </para>
+
+     <para>
+      The top-level <literal>Statement Wait Events</literal> summary counts
+      each wait once for the statement, including waits reported by parallel
+      workers.  Nested <command>EXPLAIN (ANALYZE, WAITS)</command> commands
+      maintain separate statement-level summaries; a wait is counted once in
+      each active summary.  Plan nodes can also show <literal>Wait
+      Events</literal>.  Node-level wait events are inclusive, like plan node
+      timing in
+      <command>EXPLAIN ANALYZE</command>: a wait is attributed to every
+      active plan node when the wait begins.  Therefore node-level wait event
+      times should not be summed to compute the statement total.
+     </para>
+
+     <para>
+      The overhead of this option depends on the number of waits observed and,
+      for node-level output, on the number of active plan nodes when each wait
+      begins.
+     </para>
+
+     <para>
       This parameter may only be used when <literal>ANALYZE</literal> is also
       enabled.  It defaults to <literal>FALSE</literal>.
      </para>
@@ -405,6 +442,15 @@ ROLLBACK;
    execution, and on machines that have relatively slow operating
    system calls for obtaining the time of day.
   </para>
+
+  <para>
+   The <literal>WAITS</literal> option also adds profiling overhead when
+   <literal>ANALYZE</literal> is used.  It measures wait intervals reported
+   by server instrumentation and maintains per-statement and per-node wait
+   event counters while the statement executes.  The overhead depends on the
+   number of reported waits, the number of distinct wait events, and the depth
+   of the active executor plan node stack.
+  </para>
  </refsect1>
  <refsect1>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0575ae8368b..9c198f8e599 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -151,7 +151,7 @@ static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
 static int	wait_event_usage_cmp(const void *a, const void *b);
-static void show_wait_event_usage(ExplainState *es,
+static void show_wait_event_usage(ExplainState *es, const char *labelname,
 								  const WaitEventUsage *usage);
 static void show_memory_counters(ExplainState *es,
 								 const MemoryContextCounters *mem_counters);
@@ -219,7 +219,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		 * In the case of an INSTEAD NOTHING, tell at least that.  But in
 		 * non-text format, the output is delimited, so this isn't necessary.
 		 */
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+	if (es->format == EXPLAIN_FORMAT_TEXT)
 			appendStringInfoString(es->str, "Query rewrites to nothing\n");
 	}
 	else
@@ -634,7 +634,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	ExplainPrintPlan(es, queryDesc);
 	if (waitEventUsagePtr)
-		show_wait_event_usage(es, waitEventUsagePtr);
+		show_wait_event_usage(es, "Statement Wait Events", waitEventUsagePtr);
 	/* Show buffer and/or memory usage in planning */
 	if (peek_buffer_usage(es, bufusage) || mem_counters)
@@ -2335,7 +2335,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->instr.walusage);
 	if (es->waits)
-		show_wait_event_usage(es, planstate->wait_event_usage);
+		show_wait_event_usage(es, "Wait Events", planstate->wait_event_usage);

/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4556,14 +4556,15 @@ wait_event_usage_cmp(const void *a, const void *b)
}

 static void
-show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+show_wait_event_usage(ExplainState *es, const char *labelname,
+					  const WaitEventUsage *usage)
 {
 	WaitEventUsageEntry *entries;

if (usage == NULL)
return;

-	if (usage->nentries == 0)
+	if (usage->nentries == 0 && usage->overflowed_calls == 0)
 		return;

if (usage->nentries > 0)
@@ -4577,10 +4578,10 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
else
entries = NULL;

-	if (es->format == EXPLAIN_FORMAT_TEXT)
+		if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainIndentText(es);
-		appendStringInfoString(es->str, "Wait Events:\n");
+		appendStringInfo(es->str, "%s:\n", labelname);
 		es->indent++;

for (int i = 0; i < usage->nentries; i++)
@@ -4599,35 +4600,60 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
INSTR_TIME_GET_MILLISEC(entries[i].time));
}

+		if (usage->overflowed_calls > 0)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str,
+							 "Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
+							 usage->overflowed_calls,
+							 INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+		}
+
 		es->indent--;
 	}
 	else
 	{
-		ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
-
-		for (int i = 0; i < usage->nentries; i++)
+		if (usage->nentries > 0)
 		{
-			const char *event_type;
-			const char *event_name;
+			ExplainOpenGroup("Wait-Events", labelname, false, es);
-			event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
-			event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+			for (int i = 0; i < usage->nentries; i++)
+			{
+				const char *event_type;
+				const char *event_name;
-			ExplainOpenGroup("Wait-Event", NULL, true, es);
-			ExplainPropertyText("Wait Event Type",
-								event_type ? event_type : "Unknown",
-								es);
-			ExplainPropertyText("Wait Event",
-								event_name ? event_name : "unknown",
-								es);
-			ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
-			ExplainPropertyFloat("Time", "ms",
-								 INSTR_TIME_GET_MILLISEC(entries[i].time),
-								 3, es);
-			ExplainCloseGroup("Wait-Event", NULL, true, es);
+				event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+				event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+				ExplainOpenGroup("Wait-Event", NULL, true, es);
+				ExplainPropertyText("Wait Event Type",
+									event_type ? event_type : "Unknown",
+									es);
+				ExplainPropertyText("Wait Event",
+									event_name ? event_name : "unknown",
+									es);
+				ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+				ExplainPropertyFloat("Time", "ms",
+									 INSTR_TIME_GET_MILLISEC(entries[i].time),
+									 3, es);
+				ExplainCloseGroup("Wait-Event", NULL, true, es);
+			}
+
+			ExplainCloseGroup("Wait-Events", labelname, false, es);
 		}
-		ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+		if (usage->overflowed_calls > 0)
+		{
+			/*
+			 * This is not a wait event identity, so keep it outside the
+			 * Wait Events array in structured output.
+			 */
+			ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
+									usage->overflowed_calls, es);
+			ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+								 3, es);
+		}
 	}
 	if (entries)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f77f539dd27..520b4b8484f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,6 +87,8 @@ typedef struct SharedWaitEventUsageWorker
 {
 	int			nentries;
 	dsa_pointer entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
 } SharedWaitEventUsageWorker;
 /*
@@ -173,6 +175,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
 												ExecParallelRetrieveInstrumentationContext *r);
 static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker);
+static void ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+													   SharedWaitEventUsageWorker *worker,
+													   dsa_area *area);
 static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 												   dsa_area *area,
 												   const WaitEventUsage *usage);
@@ -904,10 +910,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
 		wait_event_usage->num_workers = nworkers;
 		for (int i = 0; i < nworkers; i++)
-		{
-			wait_event_usage->worker_usage[i].nentries = 0;
-			wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
-		}
+			ExecParallelInitWaitEventUsageWorker(&wait_event_usage->worker_usage[i]);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
 					   wait_event_usage);
 		pei->wait_event_usage = wait_event_usage;
@@ -944,10 +947,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 			worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
 			for (i = 0; i < nworkers * e.nnodes; ++i)
-			{
-				worker_usage[i].nentries = 0;
-				worker_usage[i].entries = InvalidDsaPointer;
-			}
+				ExecParallelInitWaitEventUsageWorker(&worker_usage[i]);
 		}
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
@@ -1209,21 +1209,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
 		wait_event_usage += i * instrumentation->num_workers;
 		for (n = 0; n < instrumentation->num_workers; ++n)
-		{
-			SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
-			WaitEventUsageEntry *entries;
-
-			if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
-				continue;
-
-			entries = dsa_get_address(r->area, worker->entries);
-			pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
-											   entries,
-											   worker->nentries);
-			dsa_free(r->area, worker->entries);
-			worker->nentries = 0;
-			worker->entries = InvalidDsaPointer;
-		}
+			ExecParallelAccumulateWaitEventUsageWorker(planstate->wait_event_usage,
+													   &wait_event_usage[n],
+													   r->area);
 	}

/*
@@ -1339,18 +1327,46 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
return;

 	for (int i = 0; i < shared->num_workers; i++)
+		ExecParallelAccumulateWaitEventUsageWorker(usage,
+												   &shared->worker_usage[i],
+												   pei->area);
+}
+
+static void
+ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker)
+{
+	worker->nentries = 0;
+	worker->entries = InvalidDsaPointer;
+	worker->overflowed_calls = 0;
+	INSTR_TIME_SET_ZERO(worker->overflowed_time);
+}
+
+static void
+ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+										   SharedWaitEventUsageWorker *worker,
+										   dsa_area *area)
+{
+	Assert(usage != NULL);
+	Assert(worker != NULL);
+	Assert(area != NULL);
+
+	if (worker->overflowed_calls > 0)
 	{
-		SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
-		WaitEventUsageEntry *entries;
+		usage->overflowed_calls += worker->overflowed_calls;
+		INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+		worker->overflowed_calls = 0;
+		INSTR_TIME_SET_ZERO(worker->overflowed_time);
+	}
-		if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
-			continue;
+	if (worker->nentries > 0 && DsaPointerIsValid(worker->entries))
+	{
+		WaitEventUsageEntry *entries;
-		entries = dsa_get_address(pei->area, worker->entries);
+		entries = dsa_get_address(area, worker->entries);
 		pgstat_accumulate_wait_event_usage(usage,
 										   entries,
 										   worker->nentries);
-		dsa_free(pei->area, worker->entries);
+		dsa_free(area, worker->entries);
 		worker->nentries = 0;
 		worker->entries = InvalidDsaPointer;
 	}
@@ -1362,24 +1378,65 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 									   const WaitEventUsage *usage)
 {
 	WaitEventUsageEntry *entries;
+	WaitEventUsageEntry *old_entries = NULL;
 	dsa_pointer entries_dsa;
 	Size		entries_size;
+	int			old_nentries = 0;
+	int			new_nentries = 0;
+	int			i = 0;
+	int			j = 0;

Assert(worker != NULL);
Assert(area != NULL);
Assert(usage != NULL);

+	worker->overflowed_calls += usage->overflowed_calls;
+	INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+
 	if (usage->nentries <= 0)
 		return;
-	entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+	if (DsaPointerIsValid(worker->entries))
+	{
+		Assert(worker->nentries > 0);
+		old_nentries = worker->nentries;
+		old_entries = dsa_get_address(area, worker->entries);
+	}
+
+	entries_size = mul_size(sizeof(WaitEventUsageEntry),
+							(Size) old_nentries + (Size) usage->nentries);
 	entries_dsa = dsa_allocate(area, entries_size);
 	entries = dsa_get_address(area, entries_dsa);
-	memcpy(entries, usage->entries, entries_size);
+
+	while (i < old_nentries && j < usage->nentries)
+	{
+		WaitEventUsageEntry *entry = &entries[new_nentries];
+		uint32		old_info = old_entries[i].wait_event_info;
+		uint32		new_info = usage->entries[j].wait_event_info;
+
+		if (old_info < new_info)
+			*entry = old_entries[i++];
+		else if (old_info > new_info)
+			*entry = usage->entries[j++];
+		else
+		{
+			*entry = old_entries[i++];
+			entry->calls += usage->entries[j].calls;
+			INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+			j++;
+		}
+
+		new_nentries++;
+	}
+
+	while (i < old_nentries)
+		entries[new_nentries++] = old_entries[i++];
+	while (j < usage->nentries)
+		entries[new_nentries++] = usage->entries[j++];

if (DsaPointerIsValid(worker->entries))
dsa_free(area, worker->entries);
- worker->nentries = usage->nentries;
+ worker->nentries = new_nentries;
worker->entries = entries_dsa;
}

diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 3d5a10b44c4..61b418e8fd7 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -39,6 +39,10 @@ static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 							  uint64 calls, const instr_time *elapsed);
+static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+									  const instr_time *elapsed);
+static int	WaitEventUsageFind(const WaitEventUsage *usage,
+							   uint32 wait_event_info, bool *found);

static uint32 local_my_wait_event_info;
@@ -48,7 +52,16 @@ uint32 *my_wait_event_info = &local_my_wait_event_info;

 int			pgstat_wait_event_usage_depth = 0;
 static WaitEventUsage *pgstat_wait_event_usage = NULL;
+
+/*
+ * Top of the active executor node and query-level stacks.  Query-level wait
+ * accounting records each wait once in every active query-level collector.
+ * Per-node wait accounting records each wait in every active plan node,
+ * matching the inclusive timing semantics of EXPLAIN ANALYZE plan nodes.
+ */
 static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_node_stack = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_query_stack = NULL;
 static uint32 pgstat_wait_event_usage_current = 0;
 static instr_time pgstat_wait_event_usage_start;
@@ -387,24 +400,37 @@ pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
  *
  * This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
  * It records waits observed through pgstat_report_wait_start/end in backend
- * local memory.  Nested collection is deliberately treated as part of the
- * outer collection for now; callers that want independent nested accounting
- * need a stack of WaitEventUsage contexts.
+ * local memory.  Nested top-level collectors are kept in a query-level stack;
+ * a wait is counted once in each active collector.
  */
 void
 pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 {
+	bool		first;
+
 	Assert(usage != NULL);
 	Assert(memcontext != NULL);
-	if (pgstat_wait_event_usage_depth++ == 0)
+	first = pgstat_wait_event_usage_depth == 0;
+	if (first)
 	{
-		pgstat_init_wait_event_usage(usage, memcontext);
-		pgstat_wait_event_usage = usage;
 		pgstat_wait_event_node_usage = NULL;
+		pgstat_wait_event_usage_node_stack = NULL;
+		pgstat_wait_event_usage_query_stack = NULL;
 		pgstat_wait_event_usage_current = 0;
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
+
+	pgstat_init_wait_event_usage(usage, memcontext);
+	usage->query_parent = pgstat_wait_event_usage;
+	/*
+	 * A nested EXPLAIN can error out while one of its plan nodes is active,
+	 * skipping the usual node-level restore.  Remember the outer node stack so
+	 * ending this collector can discard any leaked inner node frames.
+	 */
+	usage->saved_node_usage = pgstat_wait_event_node_usage;
+	pgstat_wait_event_usage = usage;
+	pgstat_wait_event_usage_depth++;
 }
 /*
@@ -415,14 +441,22 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
 {
 	Assert(usage != NULL);
 	Assert(pgstat_wait_event_usage_depth > 0);
+	Assert(pgstat_wait_event_usage == usage);
+
+	if (pgstat_wait_event_usage_current != 0)
+		pgstat_count_wait_event_end();
+
+	pgstat_wait_event_usage = usage->query_parent;
+	usage->query_parent = NULL;
+	pgstat_wait_event_node_usage = usage->saved_node_usage;
+	usage->saved_node_usage = NULL;
 	if (--pgstat_wait_event_usage_depth == 0)
 	{
-		if (pgstat_wait_event_usage_current != 0)
-			pgstat_count_wait_event_end();
-
 		pgstat_wait_event_usage = NULL;
 		pgstat_wait_event_node_usage = NULL;
+		pgstat_wait_event_usage_node_stack = NULL;
+		pgstat_wait_event_usage_query_stack = NULL;
 		pgstat_wait_event_usage_current = 0;
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
@@ -433,6 +467,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
 {
 	WaitEventUsage *previous = pgstat_wait_event_node_usage;
+	Assert(usage != NULL);
+
+	usage->active_parent = previous;
 	pgstat_wait_event_node_usage = usage;
 	return previous;
 }
@@ -440,6 +477,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
 void
 pgstat_restore_wait_event_usage(WaitEventUsage *usage)
 {
+	Assert(pgstat_wait_event_node_usage == NULL ||
+		   pgstat_wait_event_node_usage->active_parent == usage);
+
 	pgstat_wait_event_node_usage = usage;
 }

@@ -460,6 +500,8 @@ pgstat_count_wait_event_start(uint32 wait_event_info)
pgstat_count_wait_event_end();

 	pgstat_wait_event_usage_current = wait_event_info;
+	pgstat_wait_event_usage_node_stack = pgstat_wait_event_node_usage;
+	pgstat_wait_event_usage_query_stack = pgstat_wait_event_usage;
 	INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
 }

@@ -480,17 +522,24 @@ pgstat_count_wait_event_end(void)
elapsed = end;
INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);

-	WaitEventUsageAdd(pgstat_wait_event_usage,
-					  pgstat_wait_event_usage_current,
-					  1,
-					  &elapsed);
-	if (pgstat_wait_event_node_usage != NULL)
-		WaitEventUsageAdd(pgstat_wait_event_node_usage,
+	for (WaitEventUsage *query_usage = pgstat_wait_event_usage_query_stack;
+		 query_usage != NULL;
+		 query_usage = query_usage->query_parent)
+		WaitEventUsageAdd(query_usage,
+						  pgstat_wait_event_usage_current,
+						  1,
+						  &elapsed);
+	for (WaitEventUsage *node_usage = pgstat_wait_event_usage_node_stack;
+		 node_usage != NULL;
+		 node_usage = node_usage->active_parent)
+		WaitEventUsageAdd(node_usage,
 						  pgstat_wait_event_usage_current,
 						  1,
 						  &elapsed);
 	pgstat_wait_event_usage_current = 0;
+	pgstat_wait_event_usage_node_stack = NULL;
+	pgstat_wait_event_usage_query_stack = NULL;
 	INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 }

@@ -509,56 +558,123 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
&entries[i].time);
}

+/*
+ * Find the existing entry, or the insertion position for a new entry.
+ *
+ * WaitEventUsage entries are kept sorted by wait_event_info so the hot
+ * wait-end path does not need a linear scan through all distinct wait events
+ * already seen by the query or plan node.
+ */
+static int
+WaitEventUsageFind(const WaitEventUsage *usage, uint32 wait_event_info,
+				   bool *found)
+{
+	int			low = 0;
+	int			high = usage->nentries;
+
+	while (low < high)
+	{
+		int			mid = low + (high - low) / 2;
+		uint32		entry_info = usage->entries[mid].wait_event_info;
+
+		if (entry_info < wait_event_info)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	*found = low < usage->nentries &&
+		usage->entries[low].wait_event_info == wait_event_info;
+	return low;
+}
+
 static void
 WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 				  uint64 calls, const instr_time *elapsed)
 {
-	WaitEventUsageEntry *entry = NULL;
+	bool		found;
+	int			idx;
+	WaitEventUsageEntry *entry;
-	for (int i = 0; i < usage->nentries; i++)
-	{
-		if (usage->entries[i].wait_event_info == wait_event_info)
-		{
-			entry = &usage->entries[i];
-			break;
-		}
-	}
+	idx = WaitEventUsageFind(usage, wait_event_info, &found);
-	if (entry == NULL)
+	if (!found)
 	{
 		if (usage->nentries >= usage->maxentries)
 		{
-			MemoryContext oldcontext;
 			int			newmaxentries;
+			Size		entries_size;
+			WaitEventUsageEntry *newentries;
 			if (usage->maxentries > 0)
+			{
+				if ((Size) usage->maxentries >
+					MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
+				{
+					WaitEventUsageAddOverflow(usage, calls, elapsed);
+					return;
+				}
+
 				newmaxentries = usage->maxentries * 2;
+			}
 			else
 				newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
-			oldcontext = MemoryContextSwitchTo(usage->memcontext);
+			if ((Size) newmaxentries >
+				MaxAllocSize / sizeof(WaitEventUsageEntry))
+			{
+				WaitEventUsageAddOverflow(usage, calls, elapsed);
+				return;
+			}
+
+			entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
+			/*
+			 * Wait completion can happen in a critical section, so growth
+			 * must not throw ERROR.  If storage cannot be grown without
+			 * throwing, preserve total wait time in the overflow bucket.
+			 */
 			if (usage->entries)
-				usage->entries = repalloc_array(usage->entries,
-												WaitEventUsageEntry,
-												newmaxentries);
+				newentries = repalloc_extended(usage->entries, entries_size,
+											   MCXT_ALLOC_NO_OOM);
 			else
-				usage->entries = palloc_array(WaitEventUsageEntry,
-											  newmaxentries);
-			MemoryContextSwitchTo(oldcontext);
+				newentries = MemoryContextAllocExtended(usage->memcontext,
+														entries_size,
+														MCXT_ALLOC_NO_OOM);
+			if (newentries == NULL)
+			{
+				WaitEventUsageAddOverflow(usage, calls, elapsed);
+				return;
+			}

+ usage->entries = newentries;
usage->maxentries = newmaxentries;
}

-		entry = &usage->entries[usage->nentries++];
+		if (idx < usage->nentries)
+			memmove(&usage->entries[idx + 1], &usage->entries[idx],
+					sizeof(WaitEventUsageEntry) * (usage->nentries - idx));
+
+		entry = &usage->entries[idx];
+		usage->nentries++;
 		entry->wait_event_info = wait_event_info;
 		entry->calls = 0;
 		INSTR_TIME_SET_ZERO(entry->time);
 	}
+	else
+		entry = &usage->entries[idx];

entry->calls += calls;
INSTR_TIME_ADD(entry->time, *elapsed);
}

+static void
+WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+						  const instr_time *elapsed)
+{
+	usage->overflowed_calls += calls;
+	INSTR_TIME_ADD(usage->overflowed_time, *elapsed);
+}
+
 /* ----------
  * pgstat_get_wait_event_type() -
  *
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index db65d130fcb..b33490e7415 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4524,9 +4524,9 @@ match_previous_words(int pattern_id,
 		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
-						  "BUFFERS", "SERIALIZE", "WAL", "TIMING", "SUMMARY",
+						  "BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
 						  "MEMORY", "FORMAT");
-		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|TIMING|SUMMARY|MEMORY"))
+		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
 			COMPLETE_WITH("ON", "OFF");
 		else if (TailMatches("SERIALIZE"))
 			COMPLETE_WITH("TEXT", "NONE", "BINARY");
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 64411288403..f14945cdb16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -25,9 +25,14 @@ typedef struct WaitEventUsageEntry
 typedef struct WaitEventUsage
 {
 	MemoryContext memcontext;
+	struct WaitEventUsage *active_parent; /* active plan-node stack link */
+	struct WaitEventUsage *query_parent;	/* active query-level stack link */
+	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
 	int			nentries;
 	int			maxentries;
 	WaitEventUsageEntry *entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
 } WaitEventUsage;
 extern const char *pgstat_get_wait_event(uint32 wait_event_info);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index c086fa48d80..275292d6172 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -47,6 +47,21 @@ begin
     return data::jsonb;
 end;
 $$;
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+    data text := '';
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        data := data || ln;
+    end loop;
+    return data::jsonb;
+end;
+$$;
 -- Disable JIT, or we'll get different output on machines where that's been
 -- forced on
 set jit = off;
@@ -106,11 +121,11 @@ select explain_filter('explain (analyze, waits, costs off, summary off, timing o
  Result (actual rows=N.N loops=N)
    Wait Events:
      Timeout:PgSleep calls=N time=N.N ms
- Wait Events:
+ Statement Wait Events:
    Timeout:PgSleep calls=N time=N.N ms
 (5 rows)
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
                                      ?column?                                     
 ----------------------------------------------------------------------------------
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
@@ -122,6 +137,52 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
+begin;
+create function pg_temp.nested_explain_waits() returns void
+  language plpgsql as
+$$
+begin
+  perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack.  The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+  language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+  language plpgsql as
+$$
+begin
+  begin
+    perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+  exception when division_by_zero then
+    null;
+  end;
+  perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
 begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
@@ -130,7 +191,7 @@ set local max_parallel_workers_per_gather = 1;
 select jsonb_path_query_first(
   explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
                          select pg_temp.parallel_pg_sleep(0.01)
-                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+                         from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
   '$[*] ? (@."Wait Event" == "PgSleep")'
 );
                               jsonb_path_query_first                              
@@ -149,6 +210,53 @@ select jsonb_path_query_first(
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
+rollback;
+begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic.  The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan.  The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+  language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+  select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from
+                           (select count(unique1) from tenk1
+                            where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+                           right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+  select jsonb_path_query_first(plan,
+                                '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+  from plan_json
+),
+pgsleep_wait as (
+  select node,
+         jsonb_path_query_first(node,
+                                '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+  from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+ parallel rescan waits accumulated 
+-----------------------------------
+ t
+(1 row)
+
 rollback;
 begin;
 -- This test deliberately creates a Bitmap Index Scan runtime-key wait.
@@ -157,6 +265,9 @@ begin;
 -- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
 -- The planner GUCs below are likewise test-only scaffolding to make the
 -- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
 create function pg_temp.explain_waits_sleep_int(int) returns int
   language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
 create temp table explain_waits_bitmap (a int);
@@ -176,6 +287,17 @@ select jsonb_path_query_first(
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
 rollback;
 explain (waits) select 1;
 ERROR:  EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index bd196f248db..9f50ef189b0 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -51,6 +51,22 @@ begin
 end;
 $$;
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+    data text := '';
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        data := data || ln;
+    end loop;
+    return data::jsonb;
+end;
+$$;
+
 -- Disable JIT, or we'll get different output on machines where that's been
 -- forced on
 set jit = off;
@@ -70,9 +86,40 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
 select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
 begin;
+create function pg_temp.nested_explain_waits() returns void
+  language plpgsql as
+$$
+begin
+  perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack.  The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+  language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+  language plpgsql as
+$$
+begin
+  begin
+    perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+  exception when division_by_zero then
+    null;
+  end;
+  perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
 set local debug_parallel_query = on;
@@ -80,7 +127,7 @@ set local max_parallel_workers_per_gather = 1;
 select jsonb_path_query_first(
   explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
                          select pg_temp.parallel_pg_sleep(0.01)
-                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+                         from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
   '$[*] ? (@."Wait Event" == "PgSleep")'
 );
 select jsonb_path_query_first(
@@ -91,12 +138,57 @@ select jsonb_path_query_first(
 );
 rollback;
 begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic.  The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan.  The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+  language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+  select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from
+                           (select count(unique1) from tenk1
+                            where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+                           right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+  select jsonb_path_query_first(plan,
+                                '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+  from plan_json
+),
+pgsleep_wait as (
+  select node,
+         jsonb_path_query_first(node,
+                                '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+  from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+rollback;
+begin;
 -- This test deliberately creates a Bitmap Index Scan runtime-key wait.
 -- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
 -- expression be used as an index runtime key, while PL/pgSQL prevents SQL
 -- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
 -- The planner GUCs below are likewise test-only scaffolding to make the
 -- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
 create function pg_temp.explain_waits_sleep_int(int) returns int
   language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
 create temp table explain_waits_bitmap (a int);
@@ -111,6 +203,12 @@ select jsonb_path_query_first(
                          where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
   '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
 );
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
 rollback;
 explain (waits) select 1;

--
2.52.0

#6Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#1)
[RFC PATCH v0 5/7] Harden EXPLAIN WAITS accumulator handling

---
doc/src/sgml/ref/explain.sgml | 15 ++--
src/backend/commands/explain.c | 2 +-
src/backend/executor/execProcnode.c | 11 +--
src/backend/utils/activity/wait_event.c | 100 ++++++++++--------------
src/include/utils/wait_event.h | 10 +--
5 files changed, 56 insertions(+), 82 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index d699b215120..7fa4b1cd955 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -326,13 +326,14 @@ ROLLBACK;
      </para>
      <para>
-      If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
-      event storage without risking an error while a wait is ending, waits
-      whose exact event identifier could not be stored are accumulated in an
-      <literal>Unrecorded Wait Event Calls</literal> counter and
-      <literal>Unrecorded Wait Event Time</literal> total.  This is a
-      reporting fallback under memory pressure, not a wait event emitted by
-      server instrumentation.
+      Each statement and plan-node accumulator preallocates storage for up to
+      <literal>64</literal> distinct wait event identifiers.  This bound
+      avoids memory allocation while a wait is ending.  If more distinct wait
+      event identifiers are observed, waits whose exact event identifier could
+      not be stored are accumulated in an <literal>Unrecorded Wait Event
+      Calls</literal> counter and <literal>Unrecorded Wait Event Time</literal>
+      total.  This is a reporting fallback, not a wait event emitted by server
+      instrumentation.
      </para>
      <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9c198f8e599..ee69d723cd8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -4578,7 +4578,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 	else
 		entries = NULL;
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainIndentText(es);
 		appendStringInfo(es->str, "%s:\n", labelname);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 081855b3fed..deee14839f2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -417,15 +417,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 		result->instrument = InstrAllocNode(estate->es_instrument,
 											result->async_capable);
 	if (estate->es_instrument & INSTRUMENT_WAITS)
-	{
-		MemoryContext oldcontext;
-
-		oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
-		result->wait_event_usage = palloc_object(WaitEventUsage);
-		pgstat_init_wait_event_usage(result->wait_event_usage,
-									 estate->es_query_cxt);
-		MemoryContextSwitchTo(oldcontext);
-	}
+		result->wait_event_usage =
+			pgstat_create_wait_event_usage(estate->es_query_cxt);
 	return result;
 }
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 61b418e8fd7..67980cc0a3b 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,7 +27,6 @@
 #include "storage/shmem.h"
 #include "storage/subsystems.h"
 #include "storage/spin.h"
-#include "utils/memutils.h"
 #include "utils/wait_event.h"
@@ -43,15 +42,25 @@ static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
 									  const instr_time *elapsed);
 static int	WaitEventUsageFind(const WaitEventUsage *usage,
 							   uint32 wait_event_info, bool *found);
+static void WaitEventUsageInit(WaitEventUsage *usage,
+							   MemoryContext memcontext);

static uint32 local_my_wait_event_info;
uint32 *my_wait_event_info = &local_my_wait_event_info;

-#define WAIT_EVENT_USAGE_INITIAL_EVENTS	16
+/*
+ * Hardcoded limit: each EXPLAIN WAITS statement-level or plan-node accumulator
+ * can record this many distinct wait event identities without allocating while
+ * waits are ending.  Additional distinct wait identities are accounted for in
+ * the overflow bucket.
+ */
+#define WAIT_EVENT_USAGE_MAX_EVENTS		64
-int			pgstat_wait_event_usage_depth = 0;
+/* Fast-path flag exported for inline pgstat_report_wait_start/end(). */
+bool		pgstat_wait_event_usage_active = false;
 static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static int	pgstat_wait_event_usage_depth = 0;

/*
* Top of the active executor node and query-level stacks. Query-level wait
@@ -373,26 +382,36 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}

+/*
+ * Allocate and initialize a wait event usage accumulator.
+ */
+WaitEventUsage *
+pgstat_create_wait_event_usage(MemoryContext memcontext)
+{
+	WaitEventUsage *usage;
+
+	Assert(memcontext != NULL);
+
+	usage = MemoryContextAlloc(memcontext, sizeof(WaitEventUsage));
+	WaitEventUsageInit(usage, memcontext);
+	return usage;
+}
+
 /*
  * Initialize a wait event usage accumulator.
  */
-void
-pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+static void
+WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
 {
 	Assert(usage != NULL);
 	Assert(memcontext != NULL);

memset(usage, 0, sizeof(WaitEventUsage));

-	/*
-	 * Wait events may end inside critical sections, for example while
-	 * performing synchronous I/O.  Keep usage entries in a dedicated context
-	 * where the memory manager permits that accounting path to grow.
-	 */
-	usage->memcontext = AllocSetContextCreate(memcontext,
-											  "Wait Event Usage",
-											  ALLOCSET_SMALL_SIZES);
-	MemoryContextAllowInCriticalSection(usage->memcontext, true);
+	usage->entries = MemoryContextAlloc(memcontext,
+										sizeof(WaitEventUsageEntry) *
+										WAIT_EVENT_USAGE_MAX_EVENTS);
+	usage->maxentries = WAIT_EVENT_USAGE_MAX_EVENTS;
 }

/*
@@ -421,7 +440,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}

-	pgstat_init_wait_event_usage(usage, memcontext);
+	WaitEventUsageInit(usage, memcontext);
 	usage->query_parent = pgstat_wait_event_usage;
 	/*
 	 * A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -431,6 +450,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 	usage->saved_node_usage = pgstat_wait_event_node_usage;
 	pgstat_wait_event_usage = usage;
 	pgstat_wait_event_usage_depth++;
+	pgstat_wait_event_usage_active = true;
 }

/*
@@ -453,6 +473,7 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)

 	if (--pgstat_wait_event_usage_depth == 0)
 	{
+		pgstat_wait_event_usage_active = false;
 		pgstat_wait_event_usage = NULL;
 		pgstat_wait_event_node_usage = NULL;
 		pgstat_wait_event_usage_node_stack = NULL;
@@ -602,52 +623,13 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 	{
 		if (usage->nentries >= usage->maxentries)
 		{
-			int			newmaxentries;
-			Size		entries_size;
-			WaitEventUsageEntry *newentries;
-
-			if (usage->maxentries > 0)
-			{
-				if ((Size) usage->maxentries >
-					MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
-				{
-					WaitEventUsageAddOverflow(usage, calls, elapsed);
-					return;
-				}
-
-				newmaxentries = usage->maxentries * 2;
-			}
-			else
-				newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
-
-			if ((Size) newmaxentries >
-				MaxAllocSize / sizeof(WaitEventUsageEntry))
-			{
-				WaitEventUsageAddOverflow(usage, calls, elapsed);
-				return;
-			}
-
-			entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
 			/*
-			 * Wait completion can happen in a critical section, so growth
-			 * must not throw ERROR.  If storage cannot be grown without
-			 * throwing, preserve total wait time in the overflow bucket.
+			 * Wait-end accounting must not allocate: it can run in a critical
+			 * section.  Preserve total calls/time without the exact event
+			 * identity once preallocated storage is full.
 			 */
-			if (usage->entries)
-				newentries = repalloc_extended(usage->entries, entries_size,
-											   MCXT_ALLOC_NO_OOM);
-			else
-				newentries = MemoryContextAllocExtended(usage->memcontext,
-														entries_size,
-														MCXT_ALLOC_NO_OOM);
-			if (newentries == NULL)
-			{
-				WaitEventUsageAddOverflow(usage, calls, elapsed);
-				return;
-			}
-
-			usage->entries = newentries;
-			usage->maxentries = newmaxentries;
+			WaitEventUsageAddOverflow(usage, calls, elapsed);
+			return;
 		}
 		if (idx < usage->nentries)
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f14945cdb16..67497790307 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -24,7 +24,6 @@ typedef struct WaitEventUsageEntry
 typedef struct WaitEventUsage
 {
-	MemoryContext memcontext;
 	struct WaitEventUsage *active_parent; /* active plan-node stack link */
 	struct WaitEventUsage *query_parent;	/* active query-level stack link */
 	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
@@ -41,8 +40,7 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
-extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
-										 MemoryContext memcontext);
+extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
 extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
 										  MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
@@ -55,7 +53,7 @@ extern void pgstat_count_wait_event_start(uint32 wait_event_info);
 extern void pgstat_count_wait_event_end(void);
 extern PGDLLIMPORT uint32 *my_wait_event_info;
-extern PGDLLIMPORT int pgstat_wait_event_usage_depth;
+extern PGDLLIMPORT bool pgstat_wait_event_usage_active;
 /*
@@ -101,7 +99,7 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
 static inline void
 pgstat_report_wait_start(uint32 wait_event_info)
 {
-	if (pgstat_wait_event_usage_depth > 0)
+	if (unlikely(pgstat_wait_event_usage_active))
 		pgstat_count_wait_event_start(wait_event_info);
 	/*
@@ -120,7 +118,7 @@ pgstat_report_wait_start(uint32 wait_event_info)
 static inline void
 pgstat_report_wait_end(void)
 {
-	if (pgstat_wait_event_usage_depth > 0)
+	if (unlikely(pgstat_wait_event_usage_active))
 		pgstat_count_wait_event_end();

/* see pgstat_report_wait_start() */
--
2.52.0

#7Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#1)
[RFC PATCH v0 6/7] Hide EXPLAIN WAITS accumulator internals

---
src/backend/commands/explain.c | 45 ++++++++++--------
src/backend/executor/execParallel.c | 44 +++++++++--------
src/backend/utils/activity/wait_event.c | 63 +++++++++++++++++++++++--
src/include/utils/wait_event.h | 32 ++++++++-----
4 files changed, 129 insertions(+), 55 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee69d723cd8..0e2ec510fee 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -514,7 +514,6 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	int			eflags;
 	int			instrument_option = 0;
 	SerializeMetrics serializeMetrics = {0};
-	WaitEventUsage waitEventUsage;
 	WaitEventUsage *waitEventUsagePtr = NULL;

Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -593,9 +592,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,

 		if (es->waits)
 		{
-			waitEventUsagePtr = &waitEventUsage;
-			pgstat_begin_wait_event_usage(waitEventUsagePtr,
-										  queryDesc->estate->es_query_cxt);
+			waitEventUsagePtr =
+				pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
 			queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
 		}
@@ -4559,20 +4557,29 @@ static void
 show_wait_event_usage(ExplainState *es, const char *labelname,
 					  const WaitEventUsage *usage)
 {
+	const WaitEventUsageEntry *usage_entries;
 	WaitEventUsageEntry *entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
+	int			nentries;

if (usage == NULL)
return;

-	if (usage->nentries == 0 && usage->overflowed_calls == 0)
+	if (pgstat_wait_event_usage_is_empty(usage))
 		return;
-	if (usage->nentries > 0)
+	nentries = pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+	pgstat_get_wait_event_usage_overflow(usage,
+										 &overflowed_calls,
+										 &overflowed_time);
+
+	if (nentries > 0)
 	{
-		entries = palloc_array(WaitEventUsageEntry, usage->nentries);
-		memcpy(entries, usage->entries,
-			   sizeof(WaitEventUsageEntry) * usage->nentries);
-		qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+		entries = palloc_array(WaitEventUsageEntry, nentries);
+		memcpy(entries, usage_entries,
+			   sizeof(WaitEventUsageEntry) * nentries);
+		qsort(entries, nentries, sizeof(WaitEventUsageEntry),
 			  wait_event_usage_cmp);
 	}
 	else
@@ -4584,7 +4591,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 		appendStringInfo(es->str, "%s:\n", labelname);
 		es->indent++;
-		for (int i = 0; i < usage->nentries; i++)
+		for (int i = 0; i < nentries; i++)
 		{
 			const char *event_type;
 			const char *event_name;
@@ -4600,24 +4607,24 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 							 INSTR_TIME_GET_MILLISEC(entries[i].time));
 		}
-		if (usage->overflowed_calls > 0)
+		if (overflowed_calls > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
 							 "Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
-							 usage->overflowed_calls,
-							 INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+							 overflowed_calls,
+							 INSTR_TIME_GET_MILLISEC(overflowed_time));
 		}

es->indent--;
}
else
{
- if (usage->nentries > 0)
+ if (nentries > 0)
{
ExplainOpenGroup("Wait-Events", labelname, false, es);

-			for (int i = 0; i < usage->nentries; i++)
+			for (int i = 0; i < nentries; i++)
 			{
 				const char *event_type;
 				const char *event_name;
@@ -4642,16 +4649,16 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 			ExplainCloseGroup("Wait-Events", labelname, false, es);
 		}
-		if (usage->overflowed_calls > 0)
+		if (overflowed_calls > 0)
 		{
 			/*
 			 * This is not a wait event identity, so keep it outside the
 			 * Wait Events array in structured output.
 			 */
 			ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
-									usage->overflowed_calls, es);
+									overflowed_calls, es);
 			ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
-								 INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+								 INSTR_TIME_GET_MILLISEC(overflowed_time),
 								 3, es);
 		}
 	}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 520b4b8484f..dcd06c718c8 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1352,8 +1352,9 @@ ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
 	if (worker->overflowed_calls > 0)
 	{
-		usage->overflowed_calls += worker->overflowed_calls;
-		INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+		pgstat_accumulate_wait_event_usage_overflow(usage,
+													worker->overflowed_calls,
+													&worker->overflowed_time);
 		worker->overflowed_calls = 0;
 		INSTR_TIME_SET_ZERO(worker->overflowed_time);
 	}
@@ -1377,11 +1378,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 									   dsa_area *area,
 									   const WaitEventUsage *usage)
 {
+	const WaitEventUsageEntry *usage_entries;
 	WaitEventUsageEntry *entries;
 	WaitEventUsageEntry *old_entries = NULL;
 	dsa_pointer entries_dsa;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
 	Size		entries_size;
 	int			old_nentries = 0;
+	int			usage_nentries;
 	int			new_nentries = 0;
 	int			i = 0;
 	int			j = 0;
@@ -1390,10 +1395,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 	Assert(area != NULL);
 	Assert(usage != NULL);
-	worker->overflowed_calls += usage->overflowed_calls;
-	INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+	usage_nentries =
+		pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+	pgstat_get_wait_event_usage_overflow(usage,
+										 &overflowed_calls,
+										 &overflowed_time);
+	worker->overflowed_calls += overflowed_calls;
+	INSTR_TIME_ADD(worker->overflowed_time, overflowed_time);
-	if (usage->nentries <= 0)
+	if (usage_nentries <= 0)
 		return;

if (DsaPointerIsValid(worker->entries))
@@ -1404,25 +1414,25 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
}

 	entries_size = mul_size(sizeof(WaitEventUsageEntry),
-							(Size) old_nentries + (Size) usage->nentries);
+							(Size) old_nentries + (Size) usage_nentries);
 	entries_dsa = dsa_allocate(area, entries_size);
 	entries = dsa_get_address(area, entries_dsa);
-	while (i < old_nentries && j < usage->nentries)
+	while (i < old_nentries && j < usage_nentries)
 	{
 		WaitEventUsageEntry *entry = &entries[new_nentries];
 		uint32		old_info = old_entries[i].wait_event_info;
-		uint32		new_info = usage->entries[j].wait_event_info;
+		uint32		new_info = usage_entries[j].wait_event_info;
 		if (old_info < new_info)
 			*entry = old_entries[i++];
 		else if (old_info > new_info)
-			*entry = usage->entries[j++];
+			*entry = usage_entries[j++];
 		else
 		{
 			*entry = old_entries[i++];
-			entry->calls += usage->entries[j].calls;
-			INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+			entry->calls += usage_entries[j].calls;
+			INSTR_TIME_ADD(entry->time, usage_entries[j].time);
 			j++;
 		}

@@ -1431,8 +1441,8 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,

 	while (i < old_nentries)
 		entries[new_nentries++] = old_entries[i++];
-	while (j < usage->nentries)
-		entries[new_nentries++] = usage->entries[j++];
+	while (j < usage_nentries)
+		entries[new_nentries++] = usage_entries[j++];

if (DsaPointerIsValid(worker->entries))
dsa_free(area, worker->entries);
@@ -1781,7 +1791,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
SharedJitInstrumentation *jit_instrumentation;
- WaitEventUsage waitEventUsage;
WaitEventUsage *waitEventUsagePtr = NULL;
int instrument_options = 0;
void *area_space;
@@ -1841,11 +1850,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
InstrStartParallelQuery();

 	if (wait_event_usage != NULL)
-	{
-		waitEventUsagePtr = &waitEventUsage;
-		pgstat_begin_wait_event_usage(waitEventUsagePtr,
-									  queryDesc->estate->es_query_cxt);
-	}
+		waitEventUsagePtr =
+			pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
 	/*
 	 * Run the plan.  If we specified a tuple bound, be careful not to demand
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 67980cc0a3b..9719e38729e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,6 +36,17 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
+struct WaitEventUsage
+{
+	struct WaitEventUsage *active_parent; /* active plan-node stack link */
+	struct WaitEventUsage *query_parent;	/* active query-level stack link */
+	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
+	int			nentries;
+	int			maxentries;
+	WaitEventUsageEntry *entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
+};
 static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 							  uint64 calls, const instr_time *elapsed);
 static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
@@ -422,12 +433,12 @@ WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
  * local memory.  Nested top-level collectors are kept in a query-level stack;
  * a wait is counted once in each active collector.
  */
-void
-pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+WaitEventUsage *
+pgstat_begin_wait_event_usage(MemoryContext memcontext)
 {
+	WaitEventUsage *usage;
 	bool		first;

- Assert(usage != NULL);
Assert(memcontext != NULL);

first = pgstat_wait_event_usage_depth == 0;
@@ -440,7 +451,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}

-	WaitEventUsageInit(usage, memcontext);
+	usage = pgstat_create_wait_event_usage(memcontext);
 	usage->query_parent = pgstat_wait_event_usage;
 	/*
 	 * A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -451,6 +462,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 	pgstat_wait_event_usage = usage;
 	pgstat_wait_event_usage_depth++;
 	pgstat_wait_event_usage_active = true;
+	return usage;
 }

/*
@@ -579,6 +591,49 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
&entries[i].time);
}

+void
+pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+											uint64 calls,
+											const instr_time *elapsed)
+{
+	Assert(usage != NULL);
+	Assert(elapsed != NULL);
+
+	WaitEventUsageAddOverflow(usage, calls, elapsed);
+}
+
+bool
+pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage)
+{
+	Assert(usage != NULL);
+
+	return usage->nentries == 0 && usage->overflowed_calls == 0;
+}
+
+int
+pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+									const WaitEventUsageEntry **entries)
+{
+	Assert(usage != NULL);
+	Assert(entries != NULL);
+
+	*entries = usage->entries;
+	return usage->nentries;
+}
+
+void
+pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+									 uint64 *calls,
+									 instr_time *elapsed)
+{
+	Assert(usage != NULL);
+	Assert(calls != NULL);
+	Assert(elapsed != NULL);
+
+	*calls = usage->overflowed_calls;
+	*elapsed = usage->overflowed_time;
+}
+
 /*
  * Find the existing entry, or the insertion position for a new entry.
  *
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 67497790307..19763cfcae5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -15,6 +15,12 @@
 #include "utils/palloc.h"
 #include "utils/wait_event_types.h"
+/*
+ * EXPLAIN wait event accounting support.  WaitEventUsage is intentionally
+ * opaque outside wait_event.c; callers should allocate, accumulate, and read
+ * it through the functions below.  WaitEventUsageEntry is the reportable
+ * tuple copied to EXPLAIN output and parallel-worker storage.
+ */
 typedef struct WaitEventUsageEntry
 {
 	uint32		wait_event_info;
@@ -22,17 +28,7 @@ typedef struct WaitEventUsageEntry
 	instr_time	time;
 } WaitEventUsageEntry;
-typedef struct WaitEventUsage
-{
-	struct WaitEventUsage *active_parent; /* active plan-node stack link */
-	struct WaitEventUsage *query_parent;	/* active query-level stack link */
-	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
-	int			nentries;
-	int			maxentries;
-	WaitEventUsageEntry *entries;
-	uint64		overflowed_calls;
-	instr_time	overflowed_time;
-} WaitEventUsage;
+typedef struct WaitEventUsage WaitEventUsage;
 extern const char *pgstat_get_wait_event(uint32 wait_event_info);
 extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
@@ -40,13 +36,23 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
+
+/* EXPLAIN wait event accounting. */
 extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
-extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
-										  MemoryContext memcontext);
+extern WaitEventUsage *pgstat_begin_wait_event_usage(MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
 											   const WaitEventUsageEntry *entries,
 											   int nentries);
+extern void pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+														uint64 calls,
+														const instr_time *elapsed);
+extern bool pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage);
+extern int pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+											   const WaitEventUsageEntry **entries);
+extern void pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+												 uint64 *calls,
+												 instr_time *elapsed);
 extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_count_wait_event_start(uint32 wait_event_info);
-- 
2.52.0
#8Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#1)
[RFC PATCH v0 7/7] Keep EXPLAIN option completion current

---
src/bin/psql/tab-complete.in.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b33490e7415..45712f4ac12 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4525,8 +4525,8 @@ match_previous_words(int pattern_id,
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
 						  "BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
-						  "MEMORY", "FORMAT");
-		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
+						  "MEMORY", "IO", "FORMAT");
+		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY|IO"))
 			COMPLETE_WITH("ON", "OFF");
 		else if (TailMatches("SERIALIZE"))
 			COMPLETE_WITH("TEXT", "NONE", "BINARY");
-- 
2.52.0
#9Michael Paquier
michael@paquier.xyz
In reply to: Ilmar Yunusov (#1)
Re: [RFC PATCH v0 0/7] Add EXPLAIN ANALYZE wait event reporting

On Sat, May 09, 2026 at 04:22:30AM +0500, Ilmar Yunusov wrote:

This RFC prototype adds `EXPLAIN (ANALYZE, WAITS)`, which reports
completed wait intervals observed through `pgstat_report_wait_start/end()`.
The option is named `WAITS` in this RFC to match the short style of
`BUFFERS`, `WAL`, `IO`, and `MEMORY`. I am not attached to the exact name;
`WAIT_EVENTS` may be clearer but is more verbose.

Rather than sending one email for each patch sent, the usual practice
on the Postgres community lists to send a series of patches in a
single email.
--
Michael

#10Ilmar Yunusov
tanswis42@gmail.com
In reply to: Michael Paquier (#9)
[RFC PATCH v1] Add EXPLAIN ANALYZE wait event reporting

Following Michael Paquier's note about PostgreSQL list practice, this v1
resends the same patch content as v0 as a single email with the patch series
attached.

No code changes from v0.

v0 thread:
/messages/by-id/cover.1778280923.git.tanswis42@gmail.com

This RFC prototype adds EXPLAIN (ANALYZE, WAITS), which reports completed
wait
intervals observed through pgstat_report_wait_start/end().

Statement-level wait events are counted once per active statement collector.
Plan-node wait events use inclusive attribution, similar to EXPLAIN ANALYZE
timing: a completed wait is attributed to every active plan node captured
when
the wait begins. Per-node wait times therefore must not be summed across
plan
nodes.

The patch series is split as follows:

1. Add statement-level EXPLAIN WAITS reporting.
2. Aggregate statement-level EXPLAIN WAITS from parallel workers.
3. Attribute EXPLAIN WAITS to plan nodes.
4. Refine attribution semantics, documentation, and regression coverage.
5. Harden accumulator handling with fixed-size, allocation-free storage.
6. Hide accumulator internals from public headers.
7. Keep psql EXPLAIN option completion current.

The main review questions are:

- whether the option should be named WAITS or WAIT_EVENTS;
- whether inclusive per-node attribution is the right initial semantics;
- whether the fixed accumulator limit and overflow reporting are acceptable;
- whether the disabled/enabled hot-path overhead is acceptable.

Regards,
Ilmar

Attachments:

0001-Add-EXPLAIN-WAITS-statement-reporting.patchapplication/octet-stream; name=0001-Add-EXPLAIN-WAITS-statement-reporting.patchDownload+349-4
0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patchapplication/octet-stream; name=0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patchDownload+190-4
0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patchapplication/octet-stream; name=0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patchDownload+333-28
0004-Refine-EXPLAIN-WAITS-attribution-semantics.patchapplication/octet-stream; name=0004-Refine-EXPLAIN-WAITS-attribution-semantics.patchDownload+574-105
0005-Harden-EXPLAIN-WAITS-accumulator-handling.patchapplication/octet-stream; name=0005-Harden-EXPLAIN-WAITS-accumulator-handling.patchDownload+56-83
0006-Hide-EXPLAIN-WAITS-accumulator-internals.patchapplication/octet-stream; name=0006-Hide-EXPLAIN-WAITS-accumulator-internals.patchDownload+129-56
0007-Keep-EXPLAIN-option-completion-current.patchapplication/octet-stream; name=0007-Keep-EXPLAIN-option-completion-current.patchDownload+2-3
#11Ilmar Yunusov
tanswis42@gmail.com
In reply to: Ilmar Yunusov (#10)
[RFC PATCH v2] Add EXPLAIN ANALYZE wait event reporting

This v2 keeps the same RFC feature scope as v1 and changes only regression
test coverage/stability.

v1 thread:
/messages/by-id/CALCfnuquuxtZmmzQBZ_yxaihfj7bnALXdzi9Nj=RYUW4iwY6GQ@mail.gmail.com

v0 thread:
/messages/by-id/cover.1778280923.git.tanswis42@gmail.com

The CFBot FreeBSD run showed that the regression tests assumed a too narrow
statement-level wait list. EXPLAIN WAITS can validly observe additional
statement-level waits around the measured query, for example
parallel-executor
IPC waits or DSM allocation waits. Those are valid observed waits, not an
accounting bug.

Changes in v2:

1. Make the text-output test check for the required Wait Events and
Statement
Wait Events lines, instead of expecting the full statement-level wait
list
to contain only Timeout:PgSleep.
2. Make JSON tests find Timeout:PgSleep by JSONPath instead of assuming it
is
the first wait event array element.
3. Disable debug_parallel_query and default gather workers in the explain
regression test before serial EXPLAIN checks.
4. Disable debug_parallel_query and gather workers in the bitmap runtime-key
attribution test.
5. Remove the plain-regression assertion for rescanned parallel worker wait
aggregation for now. Worker availability and the exact parallel plan
shape
are not deterministic enough for that test under the parallel regression
harness. The accounting behavior is still implemented, but this specific
edge should come back as a more isolated test if we can make it
deterministic
enough for CFBot.

There are no accounting-code changes from v1.

The main RFC questions are unchanged:

- whether the option should be named WAITS or WAIT_EVENTS;
- whether inclusive per-node attribution is the right initial semantics;
- whether the fixed accumulator limit and overflow reporting are acceptable;
- whether the disabled/enabled hot-path overhead is acceptable.

Local verification:

make -C src/test/regress check TESTS=explain
All 245 tests passed.

git diff --check
passed.

Regards,
Ilmar

Attachments:

0001-Add-EXPLAIN-WAITS-statement-reporting.patchapplication/octet-stream; name=0001-Add-EXPLAIN-WAITS-statement-reporting.patchDownload+349-4
0004-Refine-EXPLAIN-WAITS-attribution-semantics.patchapplication/octet-stream; name=0004-Refine-EXPLAIN-WAITS-attribution-semantics.patchDownload+574-105
0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patchapplication/octet-stream; name=0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patchDownload+333-28
0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patchapplication/octet-stream; name=0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patchDownload+190-4
0005-Harden-EXPLAIN-WAITS-accumulator-handling.patchapplication/octet-stream; name=0005-Harden-EXPLAIN-WAITS-accumulator-handling.patchDownload+56-83
0006-Hide-EXPLAIN-WAITS-accumulator-internals.patchapplication/octet-stream; name=0006-Hide-EXPLAIN-WAITS-accumulator-internals.patchDownload+129-56
0007-Keep-EXPLAIN-option-completion-current.patchapplication/octet-stream; name=0007-Keep-EXPLAIN-option-completion-current.patchDownload+2-3
0008-Stabilize-EXPLAIN-WAITS-regression-tests.patchapplication/octet-stream; name=0008-Stabilize-EXPLAIN-WAITS-regression-tests.patchDownload+95-97