[RFC PATCH v0 0/7] Add EXPLAIN ANALYZE wait event reporting
This RFC prototype adds `EXPLAIN (ANALYZE, WAITS)`, which reports
completed wait intervals observed through `pgstat_report_wait_start/end()`.
The option is named `WAITS` in this RFC to match the short style of
`BUFFERS`, `WAL`, `IO`, and `MEMORY`. I am not attached to the exact name;
`WAIT_EVENTS` may be clearer but is more verbose.
PostgreSQL already exposes a backend's current wait event through
pg_stat_activity. This patch explores making the same wait event
instrumentation useful in EXPLAIN ANALYZE by collecting per-statement and
per-plan-node wait event usage while a statement executes.
Statement-level output is reported as `Statement Wait Events`. It counts
each completed wait once per active statement-level collector and includes
parallel worker waits. Nested EXPLAIN ANALYZE WAITS collectors maintain
separate statement-level summaries; a wait is counted once in each active
collector.
Plan-node output is reported as `Wait Events`. Node-level attribution is
intentionally inclusive, matching EXPLAIN ANALYZE node timing: a wait is
attributed to every active plan node captured when the wait begins. This
means parent and child nodes can show the same wait, and node-level wait
times must not be summed to compute a statement total.
The implementation keeps wait-end accounting allocation-free. Each
statement and plan-node accumulator preallocates storage for 64 distinct
wait event identities; additional distinct identities are accumulated in
`Unrecorded Wait Event Calls` and `Unrecorded Wait Event Time` without event
identity. The fixed bound is intended to make the wait-end path predictable
and safe in places where allocation would be undesirable. The overflow
bucket preserves total calls/time, but loses per-event identity. This is a
deliberate RFC point.
Patch layout:
1. add statement-level EXPLAIN WAITS reporting;
2. aggregate statement-level waits from parallel workers;
3. add plan-node wait attribution, including manual executor paths;
4. refine attribution semantics, docs, overflow output, and tests;
5. harden accumulator handling and keep wait-end allocation-free;
6. hide accumulator internals behind the wait-event accounting API;
7. update EXPLAIN option tab completion.
Important review questions:
- Is the `WAITS` option name and output shape acceptable, or should this be
`WAIT_EVENTS` / different labels?
- Is inclusive per-node attribution the right semantic for EXPLAIN?
- Is the fixed 64-entry accumulator plus explicit overflow bucket acceptable?
- Is the disabled hot-path overhead of checking an exported boolean in
pgstat_report_wait_start/end acceptable?
- Are the test scaffolding choices acceptable, especially planner GUCs and
pg_sleep wrappers used to force deterministic wait-attribution cases? The
tests use pg_sleep only to force a stable Timeout:PgSleep wait identity;
durations are normalized by the existing EXPLAIN test filters.
Local verification so far:
- `make -s -j4`
- `make -C doc/src/sgml check`
- `make -s -C src/bin/psql`
- `make -C src/test/regress check-tests TESTS='test_setup create_index explain'`
- `git diff --check`
The final diff of this 7-patch branch is identical to the development branch
`r314tive/pg-wait-explain-mvp`.
Local optimized macOS microbenchmarks are directional only. The current
synthetic C wait-loop run measured roughly 0.1-0.2 ns/wait disabled overhead
and about 30 ns/wait enabled accounting for a single active node. These
numbers are not intended as performance evidence for commit; they only served
as a local smoke check that the disabled path is plausibly small. I would
want repeated Linux, CPU-pinned numbers before drawing stronger conclusions.
Ilmar Yunusov (7):
Add EXPLAIN WAITS statement reporting
Aggregate EXPLAIN WAITS from parallel workers
Attribute EXPLAIN WAITS to plan nodes
Refine EXPLAIN WAITS attribution semantics
Harden EXPLAIN WAITS accumulator handling
Hide EXPLAIN WAITS accumulator internals
Keep EXPLAIN option completion current
doc/src/sgml/ref/explain.sgml | 61 ++++
src/backend/commands/explain.c | 172 +++++++++-
src/backend/commands/explain_state.c | 8 +
src/backend/executor/execAsync.c | 22 ++
src/backend/executor/execMain.c | 1 +
src/backend/executor/execParallel.c | 295 ++++++++++++++++-
src/backend/executor/execProcnode.c | 24 +-
src/backend/executor/execUtils.c | 1 +
src/backend/executor/instrument.c | 7 +
src/backend/executor/nodeBitmapAnd.c | 7 +
src/backend/executor/nodeBitmapIndexscan.c | 7 +
src/backend/executor/nodeBitmapOr.c | 7 +
src/backend/executor/nodeHash.c | 7 +
src/backend/utils/activity/wait_event.c | 363 +++++++++++++++++++++
src/bin/psql/tab-complete.in.c | 6 +-
src/include/commands/explain_state.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 1 +
src/include/nodes/execnodes.h | 3 +
src/include/utils/wait_event.h | 45 +++
src/test/regress/expected/explain.out | 202 ++++++++++++
src/test/regress/sql/explain.sql | 144 ++++++++
22 files changed, 1371 insertions(+), 15 deletions(-)
--
2.52.0
---
doc/src/sgml/ref/explain.sgml | 14 +++
src/backend/commands/explain.c | 129 +++++++++++++++++++-
src/backend/commands/explain_state.c | 8 ++
src/backend/utils/activity/wait_event.c | 149 ++++++++++++++++++++++++
src/include/commands/explain_state.h | 1 +
src/include/utils/wait_event.h | 29 +++++
src/test/regress/expected/explain.out | 17 +++
src/test/regress/sql/explain.sql | 5 +
8 files changed, 349 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index e95e19081e1..5460568af1d 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -47,6 +47,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
IO [ <replaceable class="parameter">boolean</replaceable> ]
+ WAITS [ <replaceable class="parameter">boolean</replaceable> ]
FORMAT { TEXT | XML | JSON | YAML }
</synopsis>
</refsynopsisdiv>
@@ -310,6 +311,19 @@ ROLLBACK;
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>WAITS</literal></term>
+ <listitem>
+ <para>
+ Include exact backend-local wait event timing observed during query
+ execution. Wait events are grouped by wait event type and wait event
+ name, and include the number of waits and total time spent waiting.
+ This parameter may only be used when <literal>ANALYZE</literal> is also
+ enabled. It defaults to <literal>FALSE</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>FORMAT</literal></term>
<listitem>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 112c17b0d64..0b7cc5c15c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -45,6 +45,7 @@
#include "utils/tuplesort.h"
#include "utils/tuplestore.h"
#include "utils/typcache.h"
+#include "utils/wait_event.h"
#include "utils/xml.h"
@@ -149,6 +150,9 @@ static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
+static int wait_event_usage_cmp(const void *a, const void *b);
+static void show_wait_event_usage(ExplainState *es,
+ const WaitEventUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
static void show_result_replacement_info(Result *result, ExplainState *es);
@@ -510,6 +514,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ WaitEventUsage waitEventUsage;
+ WaitEventUsage *waitEventUsagePtr = NULL;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -583,11 +589,27 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
else
dir = ForwardScanDirection;
+ if (es->waits)
+ {
+ waitEventUsagePtr = &waitEventUsage;
+ pgstat_begin_wait_event_usage(waitEventUsagePtr,
+ queryDesc->estate->es_query_cxt);
+ }
+
/* run the plan */
- ExecutorRun(queryDesc, dir, 0);
+ PG_TRY();
+ {
+ ExecutorRun(queryDesc, dir, 0);
- /* run cleanup too */
- ExecutorFinish(queryDesc);
+ /* run cleanup too */
+ ExecutorFinish(queryDesc);
+ }
+ PG_FINALLY();
+ {
+ if (waitEventUsagePtr)
+ pgstat_end_wait_event_usage(waitEventUsagePtr);
+ }
+ PG_END_TRY();
/* We can't run ExecutorEnd 'till we're done printing the stats... */
totaltime += elapsed_time(&starttime);
@@ -605,6 +627,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
+ if (waitEventUsagePtr)
+ show_wait_event_usage(es, waitEventUsagePtr);
+
/* Show buffer and/or memory usage in planning */
if (peek_buffer_usage(es, bufusage) || mem_counters)
{
@@ -4503,6 +4528,104 @@ show_wal_usage(ExplainState *es, const WalUsage *usage)
}
}
+static int
+wait_event_usage_cmp(const void *a, const void *b)
+{
+ const WaitEventUsageEntry *ea = (const WaitEventUsageEntry *) a;
+ const WaitEventUsageEntry *eb = (const WaitEventUsageEntry *) b;
+ int64 ta = INSTR_TIME_GET_MICROSEC(ea->time);
+ int64 tb = INSTR_TIME_GET_MICROSEC(eb->time);
+
+ if (ta < tb)
+ return 1;
+ if (ta > tb)
+ return -1;
+ if (ea->wait_event_info < eb->wait_event_info)
+ return -1;
+ if (ea->wait_event_info > eb->wait_event_info)
+ return 1;
+ return 0;
+}
+
+static void
+show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+{
+ WaitEventUsageEntry *entries;
+
+ if (usage == NULL)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+ return;
+
+ if (usage->nentries > 0)
+ {
+ entries = palloc_array(WaitEventUsageEntry, usage->nentries);
+ memcpy(entries, usage->entries,
+ sizeof(WaitEventUsageEntry) * usage->nentries);
+ qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+ wait_event_usage_cmp);
+ }
+ else
+ entries = NULL;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Wait Events:\n");
+ es->indent++;
+
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ const char *event_type;
+ const char *event_name;
+
+ event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+ event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+ ExplainIndentText(es);
+ appendStringInfo(es->str, "%s:%s calls=%" PRIu64 " time=%0.3f ms\n",
+ event_type ? event_type : "Unknown",
+ event_name ? event_name : "unknown",
+ entries[i].calls,
+ INSTR_TIME_GET_MILLISEC(entries[i].time));
+ }
+
+ es->indent--;
+ }
+ else
+ {
+ ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
+
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ const char *event_type;
+ const char *event_name;
+
+ event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+ event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+ ExplainOpenGroup("Wait-Event", NULL, true, es);
+ ExplainPropertyText("Wait Event Type",
+ event_type ? event_type : "Unknown",
+ es);
+ ExplainPropertyText("Wait Event",
+ event_name ? event_name : "unknown",
+ es);
+ ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+ ExplainPropertyFloat("Time", "ms",
+ INSTR_TIME_GET_MILLISEC(entries[i].time),
+ 3, es);
+ ExplainCloseGroup("Wait-Event", NULL, true, es);
+ }
+
+ ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+ }
+
+ if (entries)
+ pfree(entries);
+}
+
/*
* Show memory usage details.
*/
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index a0ee0a664be..2abcb64cc95 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -164,6 +164,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
}
else if (strcmp(opt->defname, "io") == 0)
es->io = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "waits") == 0)
+ es->waits = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -196,6 +198,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("EXPLAIN option %s requires ANALYZE", "IO")));
+ /* check that WAITS is used with EXPLAIN ANALYZE */
+ if (es->waits && !es->analyze)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("EXPLAIN option %s requires ANALYZE", "WAITS")));
+
/* check that serialize is used with EXPLAIN ANALYZE */
if (es->serialize != EXPLAIN_SERIALIZE_NONE && !es->analyze)
ereport(ERROR,
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 95635c7f56c..60d37ccbb73 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,11 +36,20 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
static const char *pgstat_get_wait_ipc(WaitEventIPC w);
static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
+static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+ const instr_time *elapsed);
static uint32 local_my_wait_event_info;
uint32 *my_wait_event_info = &local_my_wait_event_info;
+#define WAIT_EVENT_USAGE_INITIAL_EVENTS 16
+
+int pgstat_wait_event_usage_depth = 0;
+static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static uint32 pgstat_wait_event_usage_current = 0;
+static instr_time pgstat_wait_event_usage_start;
+
#define WAIT_EVENT_CLASS_MASK 0xFF000000
#define WAIT_EVENT_ID_MASK 0x0000FFFF
@@ -349,6 +358,146 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}
+/*
+ * Start collecting exact wait event timings in this backend.
+ *
+ * This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
+ * It records waits observed through pgstat_report_wait_start/end in backend
+ * local memory. Nested collection is deliberately treated as part of the
+ * outer collection for now; callers that want independent nested accounting
+ * need a stack of WaitEventUsage contexts.
+ */
+void
+pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+ Assert(usage != NULL);
+ Assert(memcontext != NULL);
+
+ if (pgstat_wait_event_usage_depth++ == 0)
+ {
+ memset(usage, 0, sizeof(WaitEventUsage));
+ usage->memcontext = memcontext;
+ pgstat_wait_event_usage = usage;
+ pgstat_wait_event_usage_current = 0;
+ INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+ }
+}
+
+/*
+ * Stop collecting wait event timings.
+ */
+void
+pgstat_end_wait_event_usage(WaitEventUsage *usage)
+{
+ Assert(usage != NULL);
+ Assert(pgstat_wait_event_usage_depth > 0);
+
+ if (--pgstat_wait_event_usage_depth == 0)
+ {
+ if (pgstat_wait_event_usage_current != 0)
+ pgstat_count_wait_event_end();
+
+ pgstat_wait_event_usage = NULL;
+ pgstat_wait_event_usage_current = 0;
+ INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+ }
+}
+
+/*
+ * Record the beginning of a wait event for exact EXPLAIN-style accounting.
+ */
+void
+pgstat_count_wait_event_start(uint32 wait_event_info)
+{
+ if (pgstat_wait_event_usage == NULL)
+ return;
+
+ /*
+ * Waits are not expected to nest. If they do, finish the previous wait
+ * at the boundary so accounting remains internally consistent.
+ */
+ if (pgstat_wait_event_usage_current != 0)
+ pgstat_count_wait_event_end();
+
+ pgstat_wait_event_usage_current = wait_event_info;
+ INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
+}
+
+/*
+ * Record the end of the current wait event.
+ */
+void
+pgstat_count_wait_event_end(void)
+{
+ instr_time end;
+ instr_time elapsed;
+
+ if (pgstat_wait_event_usage == NULL ||
+ pgstat_wait_event_usage_current == 0)
+ return;
+
+ INSTR_TIME_SET_CURRENT(end);
+ elapsed = end;
+ INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);
+
+ WaitEventUsageAdd(pgstat_wait_event_usage,
+ pgstat_wait_event_usage_current,
+ &elapsed);
+
+ pgstat_wait_event_usage_current = 0;
+ INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+}
+
+static void
+WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+ const instr_time *elapsed)
+{
+ WaitEventUsageEntry *entry = NULL;
+
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ if (usage->entries[i].wait_event_info == wait_event_info)
+ {
+ entry = &usage->entries[i];
+ break;
+ }
+ }
+
+ if (entry == NULL)
+ {
+ if (usage->nentries >= usage->maxentries)
+ {
+ MemoryContext oldcontext;
+ int newmaxentries;
+
+ if (usage->maxentries > 0)
+ newmaxentries = usage->maxentries * 2;
+ else
+ newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
+
+ oldcontext = MemoryContextSwitchTo(usage->memcontext);
+ if (usage->entries)
+ usage->entries = repalloc_array(usage->entries,
+ WaitEventUsageEntry,
+ newmaxentries);
+ else
+ usage->entries = palloc_array(WaitEventUsageEntry,
+ newmaxentries);
+ MemoryContextSwitchTo(oldcontext);
+
+ usage->maxentries = newmaxentries;
+ }
+
+ entry = &usage->entries[usage->nentries++];
+ entry->wait_event_info = wait_event_info;
+ entry->calls = 0;
+ INSTR_TIME_SET_ZERO(entry->time);
+ }
+
+ entry->calls++;
+ INSTR_TIME_ADD(entry->time, *elapsed);
+}
+
/* ----------
* pgstat_get_wait_event_type() -
*
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 97bc7ed49f6..4f0d0d495d4 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool io; /* print info about IO (prefetch, ...) */
+ bool waits; /* print wait event usage */
bool generic; /* generate a generic plan */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 86ee348220d..f7fab5736bb 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -11,16 +11,39 @@
#define WAIT_EVENT_H
/* enums for wait events */
+#include "portability/instr_time.h"
+#include "utils/palloc.h"
#include "utils/wait_event_types.h"
+typedef struct WaitEventUsageEntry
+{
+ uint32 wait_event_info;
+ uint64 calls;
+ instr_time time;
+} WaitEventUsageEntry;
+
+typedef struct WaitEventUsage
+{
+ MemoryContext memcontext;
+ int nentries;
+ int maxentries;
+ WaitEventUsageEntry *entries;
+} WaitEventUsage;
+
extern const char *pgstat_get_wait_event(uint32 wait_event_info);
extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
+ MemoryContext memcontext);
+extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_count_wait_event_start(uint32 wait_event_info);
+extern void pgstat_count_wait_event_end(void);
extern PGDLLIMPORT uint32 *my_wait_event_info;
+extern PGDLLIMPORT int pgstat_wait_event_usage_depth;
/*
@@ -66,6 +89,9 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
static inline void
pgstat_report_wait_start(uint32 wait_event_info)
{
+ if (pgstat_wait_event_usage_depth > 0)
+ pgstat_count_wait_event_start(wait_event_info);
+
/*
* Since this is a four-byte field which is always read and written as
* four-bytes, updates are atomic.
@@ -82,6 +108,9 @@ pgstat_report_wait_start(uint32 wait_event_info)
static inline void
pgstat_report_wait_end(void)
{
+ if (pgstat_wait_event_usage_depth > 0)
+ pgstat_count_wait_event_end();
+
/* see pgstat_report_wait_start() */
*(volatile uint32 *) my_wait_event_info = 0;
}
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 74a4d87801e..2c7a7e1d4c6 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -99,6 +99,23 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
(1 row)
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+ explain_filter
+---------------------------------------
+ Result (actual rows=N.N loops=N)
+ Wait Events:
+ Timeout:PgSleep calls=N time=N.N ms
+(3 rows)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+explain (waits) select 1;
+ERROR: EXPLAIN option WAITS requires ANALYZE
\a
select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
explain_filter
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 2f163c64bf6..fe025ddeac5 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -68,6 +68,11 @@ select explain_filter('explain (analyze, buffers off, verbose) select * from int
select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+explain (waits) select 1;
+
\a
select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
select explain_filter('explain (analyze, serialize, buffers, io, format yaml) select * from int8_tbl i8');
--
2.52.0
---
src/backend/commands/explain.c | 4 +
src/backend/executor/execMain.c | 1 +
src/backend/executor/execParallel.c | 129 ++++++++++++++++++++++++
src/backend/executor/execUtils.c | 1 +
src/backend/utils/activity/wait_event.c | 22 +++-
src/include/executor/execParallel.h | 2 +
src/include/nodes/execnodes.h | 2 +
src/include/utils/wait_event.h | 3 +
src/test/regress/expected/explain.out | 17 ++++
src/test/regress/sql/explain.sql | 12 +++
10 files changed, 190 insertions(+), 3 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0b7cc5c15c6..9d7372f5415 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -594,6 +594,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
waitEventUsagePtr = &waitEventUsage;
pgstat_begin_wait_event_usage(waitEventUsagePtr,
queryDesc->estate->es_query_cxt);
+ queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
}
/* run the plan */
@@ -607,7 +608,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
PG_FINALLY();
{
if (waitEventUsagePtr)
+ {
pgstat_end_wait_event_usage(waitEventUsagePtr);
+ queryDesc->estate->es_wait_event_usage = NULL;
+ }
}
PG_END_TRY();
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b30f768680..86ab124c1c0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3066,6 +3066,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
/* es_trig_target_relations must NOT be copied */
rcestate->es_top_eflags = parentestate->es_top_eflags;
rcestate->es_instrument = parentestate->es_instrument;
+ rcestate->es_wait_event_usage = parentestate->es_wait_event_usage;
/* es_auxmodifytables must NOT be copied */
/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 81b87d82fab..8213565a708 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -51,6 +51,7 @@
#include "utils/dsa.h"
#include "utils/lsyscache.h"
#include "utils/snapmgr.h"
+#include "utils/wait_event.h"
/*
* Magic numbers for parallel executor communication. We use constants
@@ -67,6 +68,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_WAIT_EVENT_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -114,6 +116,18 @@ struct SharedExecutorInstrumentation
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
(NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
+typedef struct SharedWaitEventUsageWorker
+{
+ int nentries;
+ dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
+struct SharedWaitEventUsage
+{
+ int num_workers;
+ SharedWaitEventUsageWorker worker_usage[FLEXIBLE_ARRAY_MEMBER];
+};
+
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
{
@@ -141,6 +155,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
ParallelContext *pcxt);
static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation);
+static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+ dsa_area *area,
+ const WaitEventUsage *usage);
/* Helper function that runs in the parallel worker. */
static DestReceiver *ExecParallelGetReceiver(dsm_segment *seg, shm_toc *toc);
@@ -664,10 +682,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *paramlistinfo_space;
BufferUsage *bufusage_space;
WalUsage *walusage_space;
+ SharedWaitEventUsage *wait_event_usage = NULL;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
int paramlistinfo_len;
+ int wait_event_usage_len = 0;
int instrumentation_len = 0;
int jit_instrumentation_len = 0;
int instrument_offset = 0;
@@ -744,6 +764,16 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /* Estimate space for per-worker wait event usage metadata. */
+ if (estate->es_wait_event_usage != NULL)
+ {
+ wait_event_usage_len =
+ offsetof(SharedWaitEventUsage, worker_usage) +
+ mul_size(sizeof(SharedWaitEventUsageWorker), pcxt->nworkers);
+ shm_toc_estimate_chunk(&pcxt->estimator, wait_event_usage_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -839,6 +869,21 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Allocate metadata for each worker's wait event usage, if requested. */
+ if (estate->es_wait_event_usage != NULL)
+ {
+ wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
+ wait_event_usage->num_workers = nworkers;
+ for (int i = 0; i < nworkers; i++)
+ {
+ wait_event_usage->worker_usage[i].nentries = 0;
+ wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
+ wait_event_usage);
+ pei->wait_event_usage = wait_event_usage;
+ }
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1213,6 +1258,68 @@ ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
memcpy(planstate->worker_jit_instrument, shared_jit, ibytes);
}
+static void
+ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
+{
+ SharedWaitEventUsage *shared = pei->wait_event_usage;
+ WaitEventUsage *usage;
+
+ if (shared == NULL)
+ return;
+
+ usage = pei->planstate->state->es_wait_event_usage;
+ if (usage == NULL)
+ return;
+
+ for (int i = 0; i < shared->num_workers; i++)
+ {
+ SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
+ WaitEventUsageEntry *entries;
+
+ if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+ continue;
+
+ entries = dsa_get_address(pei->area, worker->entries);
+ pgstat_accumulate_wait_event_usage(usage,
+ entries,
+ worker->nentries);
+ dsa_free(pei->area, worker->entries);
+ worker->nentries = 0;
+ worker->entries = InvalidDsaPointer;
+ }
+}
+
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+ dsa_area *area,
+ const WaitEventUsage *usage)
+{
+ SharedWaitEventUsageWorker *worker;
+ WaitEventUsageEntry *entries;
+ dsa_pointer entries_dsa;
+ Size entries_size;
+
+ Assert(shared != NULL);
+ Assert(area != NULL);
+ Assert(usage != NULL);
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber < shared->num_workers);
+
+ if (usage->nentries <= 0)
+ return;
+
+ worker = &shared->worker_usage[ParallelWorkerNumber];
+ entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+ entries_dsa = dsa_allocate(area, entries_size);
+ entries = dsa_get_address(area, entries_dsa);
+ memcpy(entries, usage->entries, entries_size);
+
+ if (DsaPointerIsValid(worker->entries))
+ dsa_free(area, worker->entries);
+ worker->nentries = usage->nentries;
+ worker->entries = entries_dsa;
+}
+
/*
* Finish parallel execution. We wait for parallel workers to finish, and
* accumulate their buffer/WAL usage.
@@ -1261,6 +1368,9 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ /* Accumulate wait event usage, if requested. */
+ ExecParallelRetrieveWaitEventUsage(pei);
+
pei->finished = true;
}
@@ -1516,10 +1626,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ SharedWaitEventUsage *wait_event_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
SharedJitInstrumentation *jit_instrumentation;
+ WaitEventUsage waitEventUsage;
+ WaitEventUsage *waitEventUsagePtr = NULL;
int instrument_options = 0;
void *area_space;
dsa_area *area;
@@ -1535,6 +1648,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
true);
+ wait_event_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAIT_EVENT_USAGE, true);
queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
/* Setting debug_query_string for individual workers */
@@ -1576,6 +1690,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
*/
InstrStartParallelQuery();
+ if (wait_event_usage != NULL)
+ {
+ waitEventUsagePtr = &waitEventUsage;
+ pgstat_begin_wait_event_usage(waitEventUsagePtr,
+ queryDesc->estate->es_query_cxt);
+ }
+
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
* more tuples than that.
@@ -1587,6 +1708,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
+ if (waitEventUsagePtr != NULL)
+ {
+ pgstat_end_wait_event_usage(waitEventUsagePtr);
+ ExecParallelReportWaitEventUsage(wait_event_usage,
+ area,
+ waitEventUsagePtr);
+ }
+
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..80ea777632b 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -151,6 +151,7 @@ CreateExecutorState(void)
estate->es_top_eflags = 0;
estate->es_instrument = 0;
+ estate->es_wait_event_usage = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60d37ccbb73..eb01bc3d88c 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -37,7 +37,7 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
- const instr_time *elapsed);
+ uint64 calls, const instr_time *elapsed);
static uint32 local_my_wait_event_info;
@@ -442,15 +442,31 @@ pgstat_count_wait_event_end(void)
WaitEventUsageAdd(pgstat_wait_event_usage,
pgstat_wait_event_usage_current,
+ 1,
&elapsed);
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
+void
+pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+ const WaitEventUsageEntry *entries,
+ int nentries)
+{
+ Assert(usage != NULL);
+ Assert(nentries == 0 || entries != NULL);
+
+ for (int i = 0; i < nentries; i++)
+ WaitEventUsageAdd(usage,
+ entries[i].wait_event_info,
+ entries[i].calls,
+ &entries[i].time);
+}
+
static void
WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
- const instr_time *elapsed)
+ uint64 calls, const instr_time *elapsed)
{
WaitEventUsageEntry *entry = NULL;
@@ -494,7 +510,7 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
INSTR_TIME_SET_ZERO(entry->time);
}
- entry->calls++;
+ entry->calls += calls;
INSTR_TIME_ADD(entry->time, *elapsed);
}
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..71df2c2511c 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -20,6 +20,7 @@
#include "utils/dsa.h"
typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
+typedef struct SharedWaitEventUsage SharedWaitEventUsage;
typedef struct ParallelExecutorInfo
{
@@ -27,6 +28,7 @@ typedef struct ParallelExecutorInfo
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
WalUsage *wal_usage; /* walusage area in DSM */
+ SharedWaitEventUsage *wait_event_usage; /* optional */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 13359180d25..781c8c79132 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -69,6 +69,7 @@ typedef struct Tuplestorestate Tuplestorestate;
typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
+typedef struct WaitEventUsage WaitEventUsage;
typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
@@ -754,6 +755,7 @@ typedef struct EState
int es_top_eflags; /* eflags passed to ExecutorStart */
int es_instrument; /* OR of InstrumentOption flags */
+ WaitEventUsage *es_wait_event_usage; /* EXPLAIN WAITS accumulator */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f7fab5736bb..63992137ee7 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -39,6 +39,9 @@ extern void pgstat_reset_wait_event_storage(void);
extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+ const WaitEventUsageEntry *entries,
+ int nentries);
extern void pgstat_count_wait_event_start(uint32 wait_event_info);
extern void pgstat_count_wait_event_end(void);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 2c7a7e1d4c6..e3847e222be 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -114,6 +114,23 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+ language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
explain (waits) select 1;
ERROR: EXPLAIN option WAITS requires ANALYZE
\a
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index fe025ddeac5..8821250bcef 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,18 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+ language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
explain (waits) select 1;
\a
--
2.52.0
---
src/backend/commands/explain.c | 6 +-
src/backend/executor/execAsync.c | 22 ++++
src/backend/executor/execParallel.c | 131 ++++++++++++++++++---
src/backend/executor/execProcnode.c | 31 ++++-
src/backend/executor/instrument.c | 7 ++
src/backend/executor/nodeBitmapAnd.c | 7 ++
src/backend/executor/nodeBitmapIndexscan.c | 7 ++
src/backend/executor/nodeBitmapOr.c | 7 ++
src/backend/executor/nodeHash.c | 7 ++
src/backend/utils/activity/wait_event.c | 49 +++++++-
src/include/executor/instrument.h | 1 +
src/include/nodes/execnodes.h | 1 +
src/include/utils/wait_event.h | 4 +
src/test/regress/expected/explain.out | 52 +++++++-
src/test/regress/sql/explain.sql | 29 +++++
15 files changed, 333 insertions(+), 28 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9d7372f5415..0575ae8368b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -530,6 +530,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_WAL;
if (es->io)
instrument_option |= INSTRUMENT_IO;
+ if (es->waits)
+ instrument_option |= INSTRUMENT_WAITS;
/*
* We always collect timing for the entire statement, even when node-level
@@ -2332,6 +2334,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
+ if (es->waits)
+ show_wait_event_usage(es, planstate->wait_event_usage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4559,7 +4563,7 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
if (usage == NULL)
return;
- if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+ if (usage->nentries == 0)
return;
if (usage->nentries > 0)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index cf7ddbb01f4..e73ea55cf92 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -19,6 +19,7 @@
#include "executor/instrument.h"
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
+#include "utils/wait_event.h"
/*
* Asynchronously request a tuple from a designed async-capable node.
@@ -26,12 +27,17 @@
void
ExecAsyncRequest(AsyncRequest *areq)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
if (areq->requestee->chgParam != NULL) /* something changed? */
ExecReScan(areq->requestee); /* let ReScan handle this */
/* must provide our own instrumentation support */
if (areq->requestee->instrument)
InstrStartNode(areq->requestee->instrument);
+ if (areq->requestee->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
switch (nodeTag(areq->requestee))
{
@@ -47,6 +53,8 @@ ExecAsyncRequest(AsyncRequest *areq)
ExecAsyncResponse(areq);
/* must provide our own instrumentation support */
+ if (areq->requestee->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (areq->requestee->instrument)
InstrStopNode(areq->requestee->instrument,
TupIsNull(areq->result) ? 0.0 : 1.0);
@@ -62,9 +70,14 @@ ExecAsyncRequest(AsyncRequest *areq)
void
ExecAsyncConfigureWait(AsyncRequest *areq)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
/* must provide our own instrumentation support */
if (areq->requestee->instrument)
InstrStartNode(areq->requestee->instrument);
+ if (areq->requestee->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
switch (nodeTag(areq->requestee))
{
@@ -78,6 +91,8 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
}
/* must provide our own instrumentation support */
+ if (areq->requestee->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (areq->requestee->instrument)
InstrStopNode(areq->requestee->instrument, 0.0);
}
@@ -88,9 +103,14 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
void
ExecAsyncNotify(AsyncRequest *areq)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
/* must provide our own instrumentation support */
if (areq->requestee->instrument)
InstrStartNode(areq->requestee->instrument);
+ if (areq->requestee->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
switch (nodeTag(areq->requestee))
{
@@ -106,6 +126,8 @@ ExecAsyncNotify(AsyncRequest *areq)
ExecAsyncResponse(areq);
/* must provide our own instrumentation support */
+ if (areq->requestee->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (areq->requestee->instrument)
InstrStopNode(areq->requestee->instrument,
TupIsNull(areq->result) ? 0.0 : 1.0);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8213565a708..f77f539dd27 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -83,6 +83,12 @@ typedef struct FixedParallelExecutorState
int jit_flags;
} FixedParallelExecutorState;
+typedef struct SharedWaitEventUsageWorker
+{
+ int nentries;
+ dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
/*
* DSM structure for accumulating per-PlanState instrumentation.
*
@@ -92,6 +98,10 @@ typedef struct FixedParallelExecutorState
* of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
+ * wait_event_usage_offset: Offset, relative to the start of this structure,
+ * of the first SharedWaitEventUsageWorker object, or 0 if wait event usage is
+ * not being collected.
+ *
* num_workers: Number of workers.
*
* num_plan_nodes: Number of plan nodes.
@@ -103,6 +113,7 @@ struct SharedExecutorInstrumentation
{
int instrument_options;
int instrument_offset;
+ int wait_event_usage_offset;
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
@@ -110,17 +121,17 @@ struct SharedExecutorInstrumentation
/*
* Array of num_plan_nodes * num_workers NodeInstrumentation objects
* follows.
+ *
+ * If wait_event_usage_offset is non-zero, an array of num_plan_nodes *
+ * num_workers SharedWaitEventUsageWorker objects follows.
*/
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
(NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
-
-typedef struct SharedWaitEventUsageWorker
-{
- int nentries;
- dsa_pointer entries;
-} SharedWaitEventUsageWorker;
+#define GetInstrumentationWaitEventUsageArray(sei) \
+ (StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
+ (SharedWaitEventUsageWorker *) (((char *) sei) + sei->wait_event_usage_offset))
struct SharedWaitEventUsage
{
@@ -143,6 +154,12 @@ typedef struct ExecParallelInitializeDSMContext
int nnodes;
} ExecParallelInitializeDSMContext;
+typedef struct ExecParallelRetrieveInstrumentationContext
+{
+ SharedExecutorInstrumentation *instrumentation;
+ dsa_area *area;
+} ExecParallelRetrieveInstrumentationContext;
+
/* Helper functions that run in the parallel leader. */
static char *ExecSerializePlan(Plan *plan, EState *estate);
static bool ExecParallelEstimate(PlanState *planstate,
@@ -154,8 +171,11 @@ static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt,
static bool ExecParallelReInitializeDSM(PlanState *planstate,
ParallelContext *pcxt);
static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
- SharedExecutorInstrumentation *instrumentation);
+ ExecParallelRetrieveInstrumentationContext *r);
static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+ dsa_area *area,
+ const WaitEventUsage *usage);
static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
dsa_area *area,
const WaitEventUsage *usage);
@@ -691,6 +711,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
int instrumentation_len = 0;
int jit_instrumentation_len = 0;
int instrument_offset = 0;
+ int wait_event_usage_offset = 0;
Size dsa_minsize = dsa_minimum_size();
char *query_string;
int query_len;
@@ -798,6 +819,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len +=
mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
+ if (estate->es_instrument & INSTRUMENT_WAITS)
+ {
+ instrumentation_len = MAXALIGN(instrumentation_len);
+ wait_event_usage_offset = instrumentation_len;
+ instrumentation_len +=
+ mul_size(sizeof(SharedWaitEventUsageWorker),
+ mul_size(e.nnodes, nworkers));
+ }
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -903,11 +932,23 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
instrumentation->instrument_options = estate->es_instrument;
instrumentation->instrument_offset = instrument_offset;
+ instrumentation->wait_event_usage_offset = wait_event_usage_offset;
instrumentation->num_workers = nworkers;
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument, false);
+ if (wait_event_usage_offset != 0)
+ {
+ SharedWaitEventUsageWorker *worker_usage;
+
+ worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+ for (i = 0; i < nworkers * e.nnodes; ++i)
+ {
+ worker_usage[i].nentries = 0;
+ worker_usage[i].entries = InvalidDsaPointer;
+ }
+ }
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1137,9 +1178,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
*/
static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
- SharedExecutorInstrumentation *instrumentation)
+ ExecParallelRetrieveInstrumentationContext *r)
{
+ SharedExecutorInstrumentation *instrumentation = r->instrumentation;
NodeInstrumentation *instrument;
+ SharedWaitEventUsageWorker *wait_event_usage = NULL;
int i;
int n;
int ibytes;
@@ -1159,6 +1202,30 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
for (n = 0; n < instrumentation->num_workers; ++n)
InstrAggNode(planstate->instrument, &instrument[n]);
+ /* Accumulate the wait event usage from all workers. */
+ if (instrumentation->wait_event_usage_offset != 0 &&
+ planstate->wait_event_usage != NULL)
+ {
+ wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+ wait_event_usage += i * instrumentation->num_workers;
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ {
+ SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
+ WaitEventUsageEntry *entries;
+
+ if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+ continue;
+
+ entries = dsa_get_address(r->area, worker->entries);
+ pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
+ entries,
+ worker->nentries);
+ dsa_free(r->area, worker->entries);
+ worker->nentries = 0;
+ worker->entries = InvalidDsaPointer;
+ }
+ }
+
/*
* Also store the per-worker detail.
*
@@ -1216,7 +1283,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
}
return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
- instrumentation);
+ r);
}
/*
@@ -1290,25 +1357,21 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
}
static void
-ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
- dsa_area *area,
- const WaitEventUsage *usage)
+ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+ dsa_area *area,
+ const WaitEventUsage *usage)
{
- SharedWaitEventUsageWorker *worker;
WaitEventUsageEntry *entries;
dsa_pointer entries_dsa;
Size entries_size;
- Assert(shared != NULL);
+ Assert(worker != NULL);
Assert(area != NULL);
Assert(usage != NULL);
- Assert(IsParallelWorker());
- Assert(ParallelWorkerNumber < shared->num_workers);
if (usage->nentries <= 0)
return;
- worker = &shared->worker_usage[ParallelWorkerNumber];
entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
entries_dsa = dsa_allocate(area, entries_size);
entries = dsa_get_address(area, entries_dsa);
@@ -1320,6 +1383,20 @@ ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
worker->entries = entries_dsa;
}
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+ dsa_area *area,
+ const WaitEventUsage *usage)
+{
+ Assert(shared != NULL);
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber < shared->num_workers);
+
+ ExecParallelReportWaitEventUsageWorker(&shared->worker_usage[ParallelWorkerNumber],
+ area,
+ usage);
+}
+
/*
* Finish parallel execution. We wait for parallel workers to finish, and
* accumulate their buffer/WAL usage.
@@ -1385,8 +1462,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
{
/* Accumulate instrumentation, if any. */
if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ {
+ ExecParallelRetrieveInstrumentationContext r;
+
+ r.instrumentation = pei->instrumentation;
+ r.area = pei->area;
+ ExecParallelRetrieveInstrumentation(pei->planstate, &r);
+ }
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
@@ -1495,6 +1577,17 @@ ExecParallelReportInstrumentation(PlanState *planstate,
Assert(IsParallelWorker());
Assert(ParallelWorkerNumber < instrumentation->num_workers);
InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
+ if (instrumentation->wait_event_usage_offset != 0 &&
+ planstate->wait_event_usage != NULL)
+ {
+ SharedWaitEventUsageWorker *wait_event_usage;
+
+ wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+ wait_event_usage += i * instrumentation->num_workers;
+ ExecParallelReportWaitEventUsageWorker(&wait_event_usage[ParallelWorkerNumber],
+ planstate->state->es_query_dsa,
+ planstate->wait_event_usage);
+ }
return planstate_tree_walker(planstate, ExecParallelReportInstrumentation,
instrumentation);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7c4c66e323f..081855b3fed 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
#include "miscadmin.h"
+#include "utils/wait_event.h"
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
@@ -415,6 +416,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
if (estate->es_instrument)
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ if (estate->es_instrument & INSTRUMENT_WAITS)
+ {
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+ result->wait_event_usage = palloc_object(WaitEventUsage);
+ pgstat_init_wait_event_usage(result->wait_event_usage,
+ estate->es_query_cxt);
+ MemoryContextSwitchTo(oldcontext);
+ }
return result;
}
@@ -481,7 +492,8 @@ ExecProcNodeFirst(PlanState *node)
* This has essentially the same responsibilities as ExecProcNode,
* but it does not do InstrStartNode/InstrStopNode (mainly because
* it can't tell how many returned tuples to count). Each per-node
- * function must provide its own instrumentation support.
+ * function must provide its own instrumentation support, including
+ * wait event attribution when enabled.
* ----------------------------------------------------------------
*/
Node *
@@ -758,6 +770,9 @@ ExecShutdownNode(PlanState *node)
static bool
ExecShutdownNode_walker(PlanState *node, void *context)
{
+ bool node_running;
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
if (node == NULL)
return false;
@@ -773,8 +788,14 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* has never been executed, so as to avoid incorrectly making it appear
* that it has.
*/
- if (node->instrument && node->instrument->running)
+ node_running = node->instrument && node->instrument->running;
+ if (node_running)
+ {
InstrStartNode(node->instrument);
+ if (node->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->wait_event_usage);
+ }
planstate_tree_walker(node, ExecShutdownNode_walker, context);
@@ -803,8 +824,12 @@ ExecShutdownNode_walker(PlanState *node, void *context)
}
/* Stop the node if we started it above, reporting 0 tuples. */
- if (node->instrument && node->instrument->running)
+ if (node_running)
+ {
+ if (node->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
InstrStopNode(node->instrument, 0);
+ }
return false;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ffbcd572133..a1f22180f82 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,7 @@
#include "nodes/execnodes.h"
#include "portability/instr_time.h"
#include "utils/guc_hooks.h"
+#include "utils/wait_event.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
@@ -181,11 +182,17 @@ TupleTableSlot *
ExecProcNodeInstr(PlanState *node)
{
TupleTableSlot *result;
+ WaitEventUsage *previous_wait_event_usage = NULL;
InstrStartNode(node->instrument);
+ if (node->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->wait_event_usage);
result = node->ExecProcNodeReal(node);
+ if (node->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
return result;
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index 9007dda3802..e2b442bd378 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -32,6 +32,7 @@
#include "executor/instrument.h"
#include "executor/nodeBitmapAnd.h"
#include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
/* ----------------------------------------------------------------
@@ -115,10 +116,14 @@ MultiExecBitmapAnd(BitmapAndState *node)
int nplans;
int i;
TIDBitmap *result = NULL;
+ WaitEventUsage *previous_wait_event_usage = NULL;
/* must provide our own instrumentation support */
if (node->ps.instrument)
InstrStartNode(node->ps.instrument);
+ if (node->ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
/*
* get information from the node
@@ -162,6 +167,8 @@ MultiExecBitmapAnd(BitmapAndState *node)
elog(ERROR, "BitmapAnd doesn't support zero inputs");
/* must provide our own instrumentation support */
+ if (node->ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, 0 /* XXX */ );
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 7978514e1bc..c2798a81e97 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -28,6 +28,7 @@
#include "executor/nodeIndexscan.h"
#include "miscadmin.h"
#include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
/* ----------------------------------------------------------------
@@ -54,10 +55,14 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
IndexScanDesc scandesc;
double nTuples = 0;
bool doscan;
+ WaitEventUsage *previous_wait_event_usage = NULL;
/* must provide our own instrumentation support */
if (node->ss.ps.instrument)
InstrStartNode(node->ss.ps.instrument);
+ if (node->ss.ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ss.ps.wait_event_usage);
/*
* extract necessary information from index scan node
@@ -116,6 +121,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
}
/* must provide our own instrumentation support */
+ if (node->ss.ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ss.ps.instrument)
InstrStopNode(node->ss.ps.instrument, nTuples);
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 148c80fdae6..e9ee934a6f6 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -33,6 +33,7 @@
#include "executor/nodeBitmapOr.h"
#include "nodes/tidbitmap.h"
#include "miscadmin.h"
+#include "utils/wait_event.h"
/* ----------------------------------------------------------------
@@ -116,10 +117,14 @@ MultiExecBitmapOr(BitmapOrState *node)
int nplans;
int i;
TIDBitmap *result = NULL;
+ WaitEventUsage *previous_wait_event_usage = NULL;
/* must provide our own instrumentation support */
if (node->ps.instrument)
InstrStartNode(node->ps.instrument);
+ if (node->ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
/*
* get information from the node
@@ -180,6 +185,8 @@ MultiExecBitmapOr(BitmapOrState *node)
elog(ERROR, "BitmapOr doesn't support zero inputs");
/* must provide our own instrumentation support */
+ if (node->ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, 0 /* XXX */ );
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8825bb6fa23..6e9e4b4218f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -105,9 +105,14 @@ ExecHash(PlanState *pstate)
Node *
MultiExecHash(HashState *node)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
/* must provide our own instrumentation support */
if (node->ps.instrument)
InstrStartNode(node->ps.instrument);
+ if (node->ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
if (node->parallel_state != NULL)
MultiExecParallelHash(node);
@@ -115,6 +120,8 @@ MultiExecHash(HashState *node)
MultiExecPrivateHash(node);
/* must provide our own instrumentation support */
+ if (node->ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, node->hashtable->reportTuples);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index eb01bc3d88c..3d5a10b44c4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,6 +27,7 @@
#include "storage/shmem.h"
#include "storage/subsystems.h"
#include "storage/spin.h"
+#include "utils/memutils.h"
#include "utils/wait_event.h"
@@ -47,6 +48,7 @@ uint32 *my_wait_event_info = &local_my_wait_event_info;
int pgstat_wait_event_usage_depth = 0;
static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
static uint32 pgstat_wait_event_usage_current = 0;
static instr_time pgstat_wait_event_usage_start;
@@ -358,6 +360,28 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}
+/*
+ * Initialize a wait event usage accumulator.
+ */
+void
+pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+ Assert(usage != NULL);
+ Assert(memcontext != NULL);
+
+ memset(usage, 0, sizeof(WaitEventUsage));
+
+ /*
+ * Wait events may end inside critical sections, for example while
+ * performing synchronous I/O. Keep usage entries in a dedicated context
+ * where the memory manager permits that accounting path to grow.
+ */
+ usage->memcontext = AllocSetContextCreate(memcontext,
+ "Wait Event Usage",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextAllowInCriticalSection(usage->memcontext, true);
+}
+
/*
* Start collecting exact wait event timings in this backend.
*
@@ -375,9 +399,9 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
if (pgstat_wait_event_usage_depth++ == 0)
{
- memset(usage, 0, sizeof(WaitEventUsage));
- usage->memcontext = memcontext;
+ pgstat_init_wait_event_usage(usage, memcontext);
pgstat_wait_event_usage = usage;
+ pgstat_wait_event_node_usage = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
@@ -398,11 +422,27 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
pgstat_count_wait_event_end();
pgstat_wait_event_usage = NULL;
+ pgstat_wait_event_node_usage = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
}
+WaitEventUsage *
+pgstat_enter_wait_event_usage(WaitEventUsage *usage)
+{
+ WaitEventUsage *previous = pgstat_wait_event_node_usage;
+
+ pgstat_wait_event_node_usage = usage;
+ return previous;
+}
+
+void
+pgstat_restore_wait_event_usage(WaitEventUsage *usage)
+{
+ pgstat_wait_event_node_usage = usage;
+}
+
/*
* Record the beginning of a wait event for exact EXPLAIN-style accounting.
*/
@@ -444,6 +484,11 @@ pgstat_count_wait_event_end(void)
pgstat_wait_event_usage_current,
1,
&elapsed);
+ if (pgstat_wait_event_node_usage != NULL)
+ WaitEventUsageAdd(pgstat_wait_event_node_usage,
+ pgstat_wait_event_usage_current,
+ 1,
+ &elapsed);
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f093a52aae0..f46ee03d2c1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -65,6 +65,7 @@ typedef enum InstrumentOption
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
INSTRUMENT_IO = 1 << 4, /* needs IO usage */
+ INSTRUMENT_WAITS = 1 << 5, /* needs wait event usage */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 781c8c79132..56f9dcbef88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1212,6 +1212,7 @@ typedef struct PlanState
NodeInstrumentation *instrument; /* Optional runtime stats for this
* node */
+ WaitEventUsage *wait_event_usage; /* Optional wait event usage */
WorkerNodeInstrumentation *worker_instrument; /* per-worker
* instrumentation */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 63992137ee7..64411288403 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -36,12 +36,16 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
+ MemoryContext memcontext);
extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
const WaitEventUsageEntry *entries,
int nentries);
+extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_count_wait_event_start(uint32 wait_event_info);
extern void pgstat_count_wait_event_end(void);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index e3847e222be..c086fa48d80 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -101,12 +101,14 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
- explain_filter
----------------------------------------
+ explain_filter
+-----------------------------------------
Result (actual rows=N.N loops=N)
+ Wait Events:
+ Timeout:PgSleep calls=N time=N.N ms
Wait Events:
Timeout:PgSleep calls=N time=N.N ms
-(3 rows)
+(5 rows)
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
?column?
@@ -114,6 +116,12 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
@@ -130,6 +138,44 @@ select jsonb_path_query_first(
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Plan}',
+ '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+ language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
rollback;
explain (waits) select 1;
ERROR: EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 8821250bcef..bd196f248db 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,7 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
@@ -82,6 +83,34 @@ select jsonb_path_query_first(
from tenk1 where unique1 = 1') #> '{0,Wait Events}',
'$[*] ? (@."Wait Event" == "PgSleep")'
);
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Plan}',
+ '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+ language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
rollback;
explain (waits) select 1;
--
2.52.0
---
doc/src/sgml/ref/explain.sgml | 52 ++++++-
src/backend/commands/explain.c | 82 +++++++----
src/backend/executor/execParallel.c | 121 +++++++++++-----
src/backend/utils/activity/wait_event.c | 184 +++++++++++++++++++-----
src/bin/psql/tab-complete.in.c | 4 +-
src/include/utils/wait_event.h | 5 +
src/test/regress/expected/explain.out | 128 ++++++++++++++++-
src/test/regress/sql/explain.sql | 102 ++++++++++++-
8 files changed, 574 insertions(+), 104 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5460568af1d..d699b215120 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -315,9 +315,46 @@ ROLLBACK;
<term><literal>WAITS</literal></term>
<listitem>
<para>
- Include exact backend-local wait event timing observed during query
- execution. Wait events are grouped by wait event type and wait event
- name, and include the number of waits and total time spent waiting.
+ Include wait event timing observed during query execution. Wait
+ events are grouped by wait event type and wait event name, using the
+ same wait event identifiers reported by
+ <structname>pg_stat_activity</structname>; see
+ <xref linkend="wait-event-table"/>. Each entry includes the number
+ of completed waits and the total time spent waiting. Wait timing is
+ collected when this option is enabled even if <literal>TIMING</literal>
+ is disabled.
+ </para>
+
+ <para>
+ If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
+ event storage without risking an error while a wait is ending, waits
+ whose exact event identifier could not be stored are accumulated in an
+ <literal>Unrecorded Wait Event Calls</literal> counter and
+ <literal>Unrecorded Wait Event Time</literal> total. This is a
+ reporting fallback under memory pressure, not a wait event emitted by
+ server instrumentation.
+ </para>
+
+ <para>
+ The top-level <literal>Statement Wait Events</literal> summary counts
+ each wait once for the statement, including waits reported by parallel
+ workers. Nested <command>EXPLAIN (ANALYZE, WAITS)</command> commands
+ maintain separate statement-level summaries; a wait is counted once in
+ each active summary. Plan nodes can also show <literal>Wait
+ Events</literal>. Node-level wait events are inclusive, like plan node
+ timing in
+ <command>EXPLAIN ANALYZE</command>: a wait is attributed to every
+ active plan node when the wait begins. Therefore node-level wait event
+ times should not be summed to compute the statement total.
+ </para>
+
+ <para>
+ The overhead of this option depends on the number of waits observed and,
+ for node-level output, on the number of active plan nodes when each wait
+ begins.
+ </para>
+
+ <para>
This parameter may only be used when <literal>ANALYZE</literal> is also
enabled. It defaults to <literal>FALSE</literal>.
</para>
@@ -405,6 +442,15 @@ ROLLBACK;
execution, and on machines that have relatively slow operating
system calls for obtaining the time of day.
</para>
+
+ <para>
+ The <literal>WAITS</literal> option also adds profiling overhead when
+ <literal>ANALYZE</literal> is used. It measures wait intervals reported
+ by server instrumentation and maintains per-statement and per-node wait
+ event counters while the statement executes. The overhead depends on the
+ number of reported waits, the number of distinct wait events, and the depth
+ of the active executor plan node stack.
+ </para>
</refsect1>
<refsect1>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0575ae8368b..9c198f8e599 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -151,7 +151,7 @@ static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static int wait_event_usage_cmp(const void *a, const void *b);
-static void show_wait_event_usage(ExplainState *es,
+static void show_wait_event_usage(ExplainState *es, const char *labelname,
const WaitEventUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -219,7 +219,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
* In the case of an INSTEAD NOTHING, tell at least that. But in
* non-text format, the output is delimited, so this isn't necessary.
*/
- if (es->format == EXPLAIN_FORMAT_TEXT)
+ if (es->format == EXPLAIN_FORMAT_TEXT)
appendStringInfoString(es->str, "Query rewrites to nothing\n");
}
else
@@ -634,7 +634,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
ExplainPrintPlan(es, queryDesc);
if (waitEventUsagePtr)
- show_wait_event_usage(es, waitEventUsagePtr);
+ show_wait_event_usage(es, "Statement Wait Events", waitEventUsagePtr);
/* Show buffer and/or memory usage in planning */
if (peek_buffer_usage(es, bufusage) || mem_counters)
@@ -2335,7 +2335,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
if (es->waits)
- show_wait_event_usage(es, planstate->wait_event_usage);
+ show_wait_event_usage(es, "Wait Events", planstate->wait_event_usage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4556,14 +4556,15 @@ wait_event_usage_cmp(const void *a, const void *b)
}
static void
-show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+show_wait_event_usage(ExplainState *es, const char *labelname,
+ const WaitEventUsage *usage)
{
WaitEventUsageEntry *entries;
if (usage == NULL)
return;
- if (usage->nentries == 0)
+ if (usage->nentries == 0 && usage->overflowed_calls == 0)
return;
if (usage->nentries > 0)
@@ -4577,10 +4578,10 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
else
entries = NULL;
- if (es->format == EXPLAIN_FORMAT_TEXT)
+ if (es->format == EXPLAIN_FORMAT_TEXT)
{
ExplainIndentText(es);
- appendStringInfoString(es->str, "Wait Events:\n");
+ appendStringInfo(es->str, "%s:\n", labelname);
es->indent++;
for (int i = 0; i < usage->nentries; i++)
@@ -4599,35 +4600,60 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
INSTR_TIME_GET_MILLISEC(entries[i].time));
}
+ if (usage->overflowed_calls > 0)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
+ usage->overflowed_calls,
+ INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+ }
+
es->indent--;
}
else
{
- ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
-
- for (int i = 0; i < usage->nentries; i++)
+ if (usage->nentries > 0)
{
- const char *event_type;
- const char *event_name;
+ ExplainOpenGroup("Wait-Events", labelname, false, es);
- event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
- event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ const char *event_type;
+ const char *event_name;
- ExplainOpenGroup("Wait-Event", NULL, true, es);
- ExplainPropertyText("Wait Event Type",
- event_type ? event_type : "Unknown",
- es);
- ExplainPropertyText("Wait Event",
- event_name ? event_name : "unknown",
- es);
- ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
- ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(entries[i].time),
- 3, es);
- ExplainCloseGroup("Wait-Event", NULL, true, es);
+ event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+ event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+ ExplainOpenGroup("Wait-Event", NULL, true, es);
+ ExplainPropertyText("Wait Event Type",
+ event_type ? event_type : "Unknown",
+ es);
+ ExplainPropertyText("Wait Event",
+ event_name ? event_name : "unknown",
+ es);
+ ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+ ExplainPropertyFloat("Time", "ms",
+ INSTR_TIME_GET_MILLISEC(entries[i].time),
+ 3, es);
+ ExplainCloseGroup("Wait-Event", NULL, true, es);
+ }
+
+ ExplainCloseGroup("Wait-Events", labelname, false, es);
}
- ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+ if (usage->overflowed_calls > 0)
+ {
+ /*
+ * This is not a wait event identity, so keep it outside the
+ * Wait Events array in structured output.
+ */
+ ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
+ usage->overflowed_calls, es);
+ ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
+ INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+ 3, es);
+ }
}
if (entries)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f77f539dd27..520b4b8484f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,6 +87,8 @@ typedef struct SharedWaitEventUsageWorker
{
int nentries;
dsa_pointer entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
} SharedWaitEventUsageWorker;
/*
@@ -173,6 +175,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
ExecParallelRetrieveInstrumentationContext *r);
static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker);
+static void ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+ SharedWaitEventUsageWorker *worker,
+ dsa_area *area);
static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
dsa_area *area,
const WaitEventUsage *usage);
@@ -904,10 +910,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
wait_event_usage->num_workers = nworkers;
for (int i = 0; i < nworkers; i++)
- {
- wait_event_usage->worker_usage[i].nentries = 0;
- wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
- }
+ ExecParallelInitWaitEventUsageWorker(&wait_event_usage->worker_usage[i]);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
wait_event_usage);
pei->wait_event_usage = wait_event_usage;
@@ -944,10 +947,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- {
- worker_usage[i].nentries = 0;
- worker_usage[i].entries = InvalidDsaPointer;
- }
+ ExecParallelInitWaitEventUsageWorker(&worker_usage[i]);
}
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
@@ -1209,21 +1209,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
wait_event_usage += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
- {
- SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
- WaitEventUsageEntry *entries;
-
- if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
- continue;
-
- entries = dsa_get_address(r->area, worker->entries);
- pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
- entries,
- worker->nentries);
- dsa_free(r->area, worker->entries);
- worker->nentries = 0;
- worker->entries = InvalidDsaPointer;
- }
+ ExecParallelAccumulateWaitEventUsageWorker(planstate->wait_event_usage,
+ &wait_event_usage[n],
+ r->area);
}
/*
@@ -1339,18 +1327,46 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
return;
for (int i = 0; i < shared->num_workers; i++)
+ ExecParallelAccumulateWaitEventUsageWorker(usage,
+ &shared->worker_usage[i],
+ pei->area);
+}
+
+static void
+ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker)
+{
+ worker->nentries = 0;
+ worker->entries = InvalidDsaPointer;
+ worker->overflowed_calls = 0;
+ INSTR_TIME_SET_ZERO(worker->overflowed_time);
+}
+
+static void
+ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+ SharedWaitEventUsageWorker *worker,
+ dsa_area *area)
+{
+ Assert(usage != NULL);
+ Assert(worker != NULL);
+ Assert(area != NULL);
+
+ if (worker->overflowed_calls > 0)
{
- SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
- WaitEventUsageEntry *entries;
+ usage->overflowed_calls += worker->overflowed_calls;
+ INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+ worker->overflowed_calls = 0;
+ INSTR_TIME_SET_ZERO(worker->overflowed_time);
+ }
- if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
- continue;
+ if (worker->nentries > 0 && DsaPointerIsValid(worker->entries))
+ {
+ WaitEventUsageEntry *entries;
- entries = dsa_get_address(pei->area, worker->entries);
+ entries = dsa_get_address(area, worker->entries);
pgstat_accumulate_wait_event_usage(usage,
entries,
worker->nentries);
- dsa_free(pei->area, worker->entries);
+ dsa_free(area, worker->entries);
worker->nentries = 0;
worker->entries = InvalidDsaPointer;
}
@@ -1362,24 +1378,65 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
const WaitEventUsage *usage)
{
WaitEventUsageEntry *entries;
+ WaitEventUsageEntry *old_entries = NULL;
dsa_pointer entries_dsa;
Size entries_size;
+ int old_nentries = 0;
+ int new_nentries = 0;
+ int i = 0;
+ int j = 0;
Assert(worker != NULL);
Assert(area != NULL);
Assert(usage != NULL);
+ worker->overflowed_calls += usage->overflowed_calls;
+ INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+
if (usage->nentries <= 0)
return;
- entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+ if (DsaPointerIsValid(worker->entries))
+ {
+ Assert(worker->nentries > 0);
+ old_nentries = worker->nentries;
+ old_entries = dsa_get_address(area, worker->entries);
+ }
+
+ entries_size = mul_size(sizeof(WaitEventUsageEntry),
+ (Size) old_nentries + (Size) usage->nentries);
entries_dsa = dsa_allocate(area, entries_size);
entries = dsa_get_address(area, entries_dsa);
- memcpy(entries, usage->entries, entries_size);
+
+ while (i < old_nentries && j < usage->nentries)
+ {
+ WaitEventUsageEntry *entry = &entries[new_nentries];
+ uint32 old_info = old_entries[i].wait_event_info;
+ uint32 new_info = usage->entries[j].wait_event_info;
+
+ if (old_info < new_info)
+ *entry = old_entries[i++];
+ else if (old_info > new_info)
+ *entry = usage->entries[j++];
+ else
+ {
+ *entry = old_entries[i++];
+ entry->calls += usage->entries[j].calls;
+ INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+ j++;
+ }
+
+ new_nentries++;
+ }
+
+ while (i < old_nentries)
+ entries[new_nentries++] = old_entries[i++];
+ while (j < usage->nentries)
+ entries[new_nentries++] = usage->entries[j++];
if (DsaPointerIsValid(worker->entries))
dsa_free(area, worker->entries);
- worker->nentries = usage->nentries;
+ worker->nentries = new_nentries;
worker->entries = entries_dsa;
}
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 3d5a10b44c4..61b418e8fd7 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -39,6 +39,10 @@ static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
uint64 calls, const instr_time *elapsed);
+static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+ const instr_time *elapsed);
+static int WaitEventUsageFind(const WaitEventUsage *usage,
+ uint32 wait_event_info, bool *found);
static uint32 local_my_wait_event_info;
@@ -48,7 +52,16 @@ uint32 *my_wait_event_info = &local_my_wait_event_info;
int pgstat_wait_event_usage_depth = 0;
static WaitEventUsage *pgstat_wait_event_usage = NULL;
+
+/*
+ * Top of the active executor node and query-level stacks. Query-level wait
+ * accounting records each wait once in every active query-level collector.
+ * Per-node wait accounting records each wait in every active plan node,
+ * matching the inclusive timing semantics of EXPLAIN ANALYZE plan nodes.
+ */
static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_node_stack = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_query_stack = NULL;
static uint32 pgstat_wait_event_usage_current = 0;
static instr_time pgstat_wait_event_usage_start;
@@ -387,24 +400,37 @@ pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
*
* This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
* It records waits observed through pgstat_report_wait_start/end in backend
- * local memory. Nested collection is deliberately treated as part of the
- * outer collection for now; callers that want independent nested accounting
- * need a stack of WaitEventUsage contexts.
+ * local memory. Nested top-level collectors are kept in a query-level stack;
+ * a wait is counted once in each active collector.
*/
void
pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
{
+ bool first;
+
Assert(usage != NULL);
Assert(memcontext != NULL);
- if (pgstat_wait_event_usage_depth++ == 0)
+ first = pgstat_wait_event_usage_depth == 0;
+ if (first)
{
- pgstat_init_wait_event_usage(usage, memcontext);
- pgstat_wait_event_usage = usage;
pgstat_wait_event_node_usage = NULL;
+ pgstat_wait_event_usage_node_stack = NULL;
+ pgstat_wait_event_usage_query_stack = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
+
+ pgstat_init_wait_event_usage(usage, memcontext);
+ usage->query_parent = pgstat_wait_event_usage;
+ /*
+ * A nested EXPLAIN can error out while one of its plan nodes is active,
+ * skipping the usual node-level restore. Remember the outer node stack so
+ * ending this collector can discard any leaked inner node frames.
+ */
+ usage->saved_node_usage = pgstat_wait_event_node_usage;
+ pgstat_wait_event_usage = usage;
+ pgstat_wait_event_usage_depth++;
}
/*
@@ -415,14 +441,22 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
{
Assert(usage != NULL);
Assert(pgstat_wait_event_usage_depth > 0);
+ Assert(pgstat_wait_event_usage == usage);
+
+ if (pgstat_wait_event_usage_current != 0)
+ pgstat_count_wait_event_end();
+
+ pgstat_wait_event_usage = usage->query_parent;
+ usage->query_parent = NULL;
+ pgstat_wait_event_node_usage = usage->saved_node_usage;
+ usage->saved_node_usage = NULL;
if (--pgstat_wait_event_usage_depth == 0)
{
- if (pgstat_wait_event_usage_current != 0)
- pgstat_count_wait_event_end();
-
pgstat_wait_event_usage = NULL;
pgstat_wait_event_node_usage = NULL;
+ pgstat_wait_event_usage_node_stack = NULL;
+ pgstat_wait_event_usage_query_stack = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
@@ -433,6 +467,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
{
WaitEventUsage *previous = pgstat_wait_event_node_usage;
+ Assert(usage != NULL);
+
+ usage->active_parent = previous;
pgstat_wait_event_node_usage = usage;
return previous;
}
@@ -440,6 +477,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
void
pgstat_restore_wait_event_usage(WaitEventUsage *usage)
{
+ Assert(pgstat_wait_event_node_usage == NULL ||
+ pgstat_wait_event_node_usage->active_parent == usage);
+
pgstat_wait_event_node_usage = usage;
}
@@ -460,6 +500,8 @@ pgstat_count_wait_event_start(uint32 wait_event_info)
pgstat_count_wait_event_end();
pgstat_wait_event_usage_current = wait_event_info;
+ pgstat_wait_event_usage_node_stack = pgstat_wait_event_node_usage;
+ pgstat_wait_event_usage_query_stack = pgstat_wait_event_usage;
INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
}
@@ -480,17 +522,24 @@ pgstat_count_wait_event_end(void)
elapsed = end;
INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);
- WaitEventUsageAdd(pgstat_wait_event_usage,
- pgstat_wait_event_usage_current,
- 1,
- &elapsed);
- if (pgstat_wait_event_node_usage != NULL)
- WaitEventUsageAdd(pgstat_wait_event_node_usage,
+ for (WaitEventUsage *query_usage = pgstat_wait_event_usage_query_stack;
+ query_usage != NULL;
+ query_usage = query_usage->query_parent)
+ WaitEventUsageAdd(query_usage,
+ pgstat_wait_event_usage_current,
+ 1,
+ &elapsed);
+ for (WaitEventUsage *node_usage = pgstat_wait_event_usage_node_stack;
+ node_usage != NULL;
+ node_usage = node_usage->active_parent)
+ WaitEventUsageAdd(node_usage,
pgstat_wait_event_usage_current,
1,
&elapsed);
pgstat_wait_event_usage_current = 0;
+ pgstat_wait_event_usage_node_stack = NULL;
+ pgstat_wait_event_usage_query_stack = NULL;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
@@ -509,56 +558,123 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
&entries[i].time);
}
+/*
+ * Find the existing entry, or the insertion position for a new entry.
+ *
+ * WaitEventUsage entries are kept sorted by wait_event_info so the hot
+ * wait-end path does not need a linear scan through all distinct wait events
+ * already seen by the query or plan node.
+ */
+static int
+WaitEventUsageFind(const WaitEventUsage *usage, uint32 wait_event_info,
+ bool *found)
+{
+ int low = 0;
+ int high = usage->nentries;
+
+ while (low < high)
+ {
+ int mid = low + (high - low) / 2;
+ uint32 entry_info = usage->entries[mid].wait_event_info;
+
+ if (entry_info < wait_event_info)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *found = low < usage->nentries &&
+ usage->entries[low].wait_event_info == wait_event_info;
+ return low;
+}
+
static void
WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
uint64 calls, const instr_time *elapsed)
{
- WaitEventUsageEntry *entry = NULL;
+ bool found;
+ int idx;
+ WaitEventUsageEntry *entry;
- for (int i = 0; i < usage->nentries; i++)
- {
- if (usage->entries[i].wait_event_info == wait_event_info)
- {
- entry = &usage->entries[i];
- break;
- }
- }
+ idx = WaitEventUsageFind(usage, wait_event_info, &found);
- if (entry == NULL)
+ if (!found)
{
if (usage->nentries >= usage->maxentries)
{
- MemoryContext oldcontext;
int newmaxentries;
+ Size entries_size;
+ WaitEventUsageEntry *newentries;
if (usage->maxentries > 0)
+ {
+ if ((Size) usage->maxentries >
+ MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
+ {
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
+ }
+
newmaxentries = usage->maxentries * 2;
+ }
else
newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
- oldcontext = MemoryContextSwitchTo(usage->memcontext);
+ if ((Size) newmaxentries >
+ MaxAllocSize / sizeof(WaitEventUsageEntry))
+ {
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
+ }
+
+ entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
+ /*
+ * Wait completion can happen in a critical section, so growth
+ * must not throw ERROR. If storage cannot be grown without
+ * throwing, preserve total wait time in the overflow bucket.
+ */
if (usage->entries)
- usage->entries = repalloc_array(usage->entries,
- WaitEventUsageEntry,
- newmaxentries);
+ newentries = repalloc_extended(usage->entries, entries_size,
+ MCXT_ALLOC_NO_OOM);
else
- usage->entries = palloc_array(WaitEventUsageEntry,
- newmaxentries);
- MemoryContextSwitchTo(oldcontext);
+ newentries = MemoryContextAllocExtended(usage->memcontext,
+ entries_size,
+ MCXT_ALLOC_NO_OOM);
+ if (newentries == NULL)
+ {
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
+ }
+ usage->entries = newentries;
usage->maxentries = newmaxentries;
}
- entry = &usage->entries[usage->nentries++];
+ if (idx < usage->nentries)
+ memmove(&usage->entries[idx + 1], &usage->entries[idx],
+ sizeof(WaitEventUsageEntry) * (usage->nentries - idx));
+
+ entry = &usage->entries[idx];
+ usage->nentries++;
entry->wait_event_info = wait_event_info;
entry->calls = 0;
INSTR_TIME_SET_ZERO(entry->time);
}
+ else
+ entry = &usage->entries[idx];
entry->calls += calls;
INSTR_TIME_ADD(entry->time, *elapsed);
}
+static void
+WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+ const instr_time *elapsed)
+{
+ usage->overflowed_calls += calls;
+ INSTR_TIME_ADD(usage->overflowed_time, *elapsed);
+}
+
/* ----------
* pgstat_get_wait_event_type() -
*
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index db65d130fcb..b33490e7415 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4524,9 +4524,9 @@ match_previous_words(int pattern_id,
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
- "BUFFERS", "SERIALIZE", "WAL", "TIMING", "SUMMARY",
+ "BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
"MEMORY", "FORMAT");
- else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|TIMING|SUMMARY|MEMORY"))
+ else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("SERIALIZE"))
COMPLETE_WITH("TEXT", "NONE", "BINARY");
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 64411288403..f14945cdb16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -25,9 +25,14 @@ typedef struct WaitEventUsageEntry
typedef struct WaitEventUsage
{
MemoryContext memcontext;
+ struct WaitEventUsage *active_parent; /* active plan-node stack link */
+ struct WaitEventUsage *query_parent; /* active query-level stack link */
+ struct WaitEventUsage *saved_node_usage; /* node stack at query start */
int nentries;
int maxentries;
WaitEventUsageEntry *entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
} WaitEventUsage;
extern const char *pgstat_get_wait_event(uint32 wait_event_info);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index c086fa48d80..275292d6172 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -47,6 +47,21 @@ begin
return data::jsonb;
end;
$$;
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
-- Disable JIT, or we'll get different output on machines where that's been
-- forced on
set jit = off;
@@ -106,11 +121,11 @@ select explain_filter('explain (analyze, waits, costs off, summary off, timing o
Result (actual rows=N.N loops=N)
Wait Events:
Timeout:PgSleep calls=N time=N.N ms
- Wait Events:
+ Statement Wait Events:
Timeout:PgSleep calls=N time=N.N ms
(5 rows)
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
?column?
----------------------------------------------------------------------------------
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
@@ -122,6 +137,52 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+begin;
+create function pg_temp.nested_explain_waits() returns void
+ language plpgsql as
+$$
+begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack. The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+ language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+ language plpgsql as
+$$
+begin
+ begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+ exception when division_by_zero then
+ null;
+ end;
+ perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
@@ -130,7 +191,7 @@ set local max_parallel_workers_per_gather = 1;
select jsonb_path_query_first(
explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
select pg_temp.parallel_pg_sleep(0.01)
- from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
'$[*] ? (@."Wait Event" == "PgSleep")'
);
jsonb_path_query_first
@@ -149,6 +210,53 @@ select jsonb_path_query_first(
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+rollback;
+begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic. The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan. The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+ language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+ select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from
+ (select count(unique1) from tenk1
+ where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+ right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+ select jsonb_path_query_first(plan,
+ '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+ from plan_json
+),
+pgsleep_wait as (
+ select node,
+ jsonb_path_query_first(node,
+ '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+ from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+ parallel rescan waits accumulated
+-----------------------------------
+ t
+(1 row)
+
rollback;
begin;
-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
@@ -157,6 +265,9 @@ begin;
-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
-- The planner GUCs below are likewise test-only scaffolding to make the
-- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
create function pg_temp.explain_waits_sleep_int(int) returns int
language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
create temp table explain_waits_bitmap (a int);
@@ -176,6 +287,17 @@ select jsonb_path_query_first(
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
rollback;
explain (waits) select 1;
ERROR: EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index bd196f248db..9f50ef189b0 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -51,6 +51,22 @@ begin
end;
$$;
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+
-- Disable JIT, or we'll get different output on machines where that's been
-- forced on
set jit = off;
@@ -70,9 +86,40 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
begin;
+create function pg_temp.nested_explain_waits() returns void
+ language plpgsql as
+$$
+begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack. The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+ language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+ language plpgsql as
+$$
+begin
+ begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+ exception when division_by_zero then
+ null;
+ end;
+ perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
set local debug_parallel_query = on;
@@ -80,7 +127,7 @@ set local max_parallel_workers_per_gather = 1;
select jsonb_path_query_first(
explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
select pg_temp.parallel_pg_sleep(0.01)
- from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
'$[*] ? (@."Wait Event" == "PgSleep")'
);
select jsonb_path_query_first(
@@ -91,12 +138,57 @@ select jsonb_path_query_first(
);
rollback;
begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic. The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan. The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+ language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+ select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from
+ (select count(unique1) from tenk1
+ where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+ right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+ select jsonb_path_query_first(plan,
+ '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+ from plan_json
+),
+pgsleep_wait as (
+ select node,
+ jsonb_path_query_first(node,
+ '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+ from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+rollback;
+begin;
-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
-- The planner GUCs below are likewise test-only scaffolding to make the
-- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
create function pg_temp.explain_waits_sleep_int(int) returns int
language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
create temp table explain_waits_bitmap (a int);
@@ -111,6 +203,12 @@ select jsonb_path_query_first(
where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
'$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
);
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
rollback;
explain (waits) select 1;
--
2.52.0
---
doc/src/sgml/ref/explain.sgml | 15 ++--
src/backend/commands/explain.c | 2 +-
src/backend/executor/execProcnode.c | 11 +--
src/backend/utils/activity/wait_event.c | 100 ++++++++++--------------
src/include/utils/wait_event.h | 10 +--
5 files changed, 56 insertions(+), 82 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index d699b215120..7fa4b1cd955 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -326,13 +326,14 @@ ROLLBACK;
</para>
<para>
- If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
- event storage without risking an error while a wait is ending, waits
- whose exact event identifier could not be stored are accumulated in an
- <literal>Unrecorded Wait Event Calls</literal> counter and
- <literal>Unrecorded Wait Event Time</literal> total. This is a
- reporting fallback under memory pressure, not a wait event emitted by
- server instrumentation.
+ Each statement and plan-node accumulator preallocates storage for up to
+ <literal>64</literal> distinct wait event identifiers. This bound
+ avoids memory allocation while a wait is ending. If more distinct wait
+ event identifiers are observed, waits whose exact event identifier could
+ not be stored are accumulated in an <literal>Unrecorded Wait Event
+ Calls</literal> counter and <literal>Unrecorded Wait Event Time</literal>
+ total. This is a reporting fallback, not a wait event emitted by server
+ instrumentation.
</para>
<para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9c198f8e599..ee69d723cd8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -4578,7 +4578,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
else
entries = NULL;
- if (es->format == EXPLAIN_FORMAT_TEXT)
+ if (es->format == EXPLAIN_FORMAT_TEXT)
{
ExplainIndentText(es);
appendStringInfo(es->str, "%s:\n", labelname);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 081855b3fed..deee14839f2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -417,15 +417,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
if (estate->es_instrument & INSTRUMENT_WAITS)
- {
- MemoryContext oldcontext;
-
- oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- result->wait_event_usage = palloc_object(WaitEventUsage);
- pgstat_init_wait_event_usage(result->wait_event_usage,
- estate->es_query_cxt);
- MemoryContextSwitchTo(oldcontext);
- }
+ result->wait_event_usage =
+ pgstat_create_wait_event_usage(estate->es_query_cxt);
return result;
}
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 61b418e8fd7..67980cc0a3b 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,7 +27,6 @@
#include "storage/shmem.h"
#include "storage/subsystems.h"
#include "storage/spin.h"
-#include "utils/memutils.h"
#include "utils/wait_event.h"
@@ -43,15 +42,25 @@ static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
const instr_time *elapsed);
static int WaitEventUsageFind(const WaitEventUsage *usage,
uint32 wait_event_info, bool *found);
+static void WaitEventUsageInit(WaitEventUsage *usage,
+ MemoryContext memcontext);
static uint32 local_my_wait_event_info;
uint32 *my_wait_event_info = &local_my_wait_event_info;
-#define WAIT_EVENT_USAGE_INITIAL_EVENTS 16
+/*
+ * Hardcoded limit: each EXPLAIN WAITS statement-level or plan-node accumulator
+ * can record this many distinct wait event identities without allocating while
+ * waits are ending. Additional distinct wait identities are accounted for in
+ * the overflow bucket.
+ */
+#define WAIT_EVENT_USAGE_MAX_EVENTS 64
-int pgstat_wait_event_usage_depth = 0;
+/* Fast-path flag exported for inline pgstat_report_wait_start/end(). */
+bool pgstat_wait_event_usage_active = false;
static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static int pgstat_wait_event_usage_depth = 0;
/*
* Top of the active executor node and query-level stacks. Query-level wait
@@ -373,26 +382,36 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}
+/*
+ * Allocate and initialize a wait event usage accumulator.
+ */
+WaitEventUsage *
+pgstat_create_wait_event_usage(MemoryContext memcontext)
+{
+ WaitEventUsage *usage;
+
+ Assert(memcontext != NULL);
+
+ usage = MemoryContextAlloc(memcontext, sizeof(WaitEventUsage));
+ WaitEventUsageInit(usage, memcontext);
+ return usage;
+}
+
/*
* Initialize a wait event usage accumulator.
*/
-void
-pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+static void
+WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
{
Assert(usage != NULL);
Assert(memcontext != NULL);
memset(usage, 0, sizeof(WaitEventUsage));
- /*
- * Wait events may end inside critical sections, for example while
- * performing synchronous I/O. Keep usage entries in a dedicated context
- * where the memory manager permits that accounting path to grow.
- */
- usage->memcontext = AllocSetContextCreate(memcontext,
- "Wait Event Usage",
- ALLOCSET_SMALL_SIZES);
- MemoryContextAllowInCriticalSection(usage->memcontext, true);
+ usage->entries = MemoryContextAlloc(memcontext,
+ sizeof(WaitEventUsageEntry) *
+ WAIT_EVENT_USAGE_MAX_EVENTS);
+ usage->maxentries = WAIT_EVENT_USAGE_MAX_EVENTS;
}
/*
@@ -421,7 +440,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
- pgstat_init_wait_event_usage(usage, memcontext);
+ WaitEventUsageInit(usage, memcontext);
usage->query_parent = pgstat_wait_event_usage;
/*
* A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -431,6 +450,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
usage->saved_node_usage = pgstat_wait_event_node_usage;
pgstat_wait_event_usage = usage;
pgstat_wait_event_usage_depth++;
+ pgstat_wait_event_usage_active = true;
}
/*
@@ -453,6 +473,7 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
if (--pgstat_wait_event_usage_depth == 0)
{
+ pgstat_wait_event_usage_active = false;
pgstat_wait_event_usage = NULL;
pgstat_wait_event_node_usage = NULL;
pgstat_wait_event_usage_node_stack = NULL;
@@ -602,52 +623,13 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
{
if (usage->nentries >= usage->maxentries)
{
- int newmaxentries;
- Size entries_size;
- WaitEventUsageEntry *newentries;
-
- if (usage->maxentries > 0)
- {
- if ((Size) usage->maxentries >
- MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
- {
- WaitEventUsageAddOverflow(usage, calls, elapsed);
- return;
- }
-
- newmaxentries = usage->maxentries * 2;
- }
- else
- newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
-
- if ((Size) newmaxentries >
- MaxAllocSize / sizeof(WaitEventUsageEntry))
- {
- WaitEventUsageAddOverflow(usage, calls, elapsed);
- return;
- }
-
- entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
/*
- * Wait completion can happen in a critical section, so growth
- * must not throw ERROR. If storage cannot be grown without
- * throwing, preserve total wait time in the overflow bucket.
+ * Wait-end accounting must not allocate: it can run in a critical
+ * section. Preserve total calls/time without the exact event
+ * identity once preallocated storage is full.
*/
- if (usage->entries)
- newentries = repalloc_extended(usage->entries, entries_size,
- MCXT_ALLOC_NO_OOM);
- else
- newentries = MemoryContextAllocExtended(usage->memcontext,
- entries_size,
- MCXT_ALLOC_NO_OOM);
- if (newentries == NULL)
- {
- WaitEventUsageAddOverflow(usage, calls, elapsed);
- return;
- }
-
- usage->entries = newentries;
- usage->maxentries = newmaxentries;
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
}
if (idx < usage->nentries)
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f14945cdb16..67497790307 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -24,7 +24,6 @@ typedef struct WaitEventUsageEntry
typedef struct WaitEventUsage
{
- MemoryContext memcontext;
struct WaitEventUsage *active_parent; /* active plan-node stack link */
struct WaitEventUsage *query_parent; /* active query-level stack link */
struct WaitEventUsage *saved_node_usage; /* node stack at query start */
@@ -41,8 +40,7 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
-extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
- MemoryContext memcontext);
+extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
@@ -55,7 +53,7 @@ extern void pgstat_count_wait_event_start(uint32 wait_event_info);
extern void pgstat_count_wait_event_end(void);
extern PGDLLIMPORT uint32 *my_wait_event_info;
-extern PGDLLIMPORT int pgstat_wait_event_usage_depth;
+extern PGDLLIMPORT bool pgstat_wait_event_usage_active;
/*
@@ -101,7 +99,7 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
static inline void
pgstat_report_wait_start(uint32 wait_event_info)
{
- if (pgstat_wait_event_usage_depth > 0)
+ if (unlikely(pgstat_wait_event_usage_active))
pgstat_count_wait_event_start(wait_event_info);
/*
@@ -120,7 +118,7 @@ pgstat_report_wait_start(uint32 wait_event_info)
static inline void
pgstat_report_wait_end(void)
{
- if (pgstat_wait_event_usage_depth > 0)
+ if (unlikely(pgstat_wait_event_usage_active))
pgstat_count_wait_event_end();
/* see pgstat_report_wait_start() */
--
2.52.0
---
src/backend/commands/explain.c | 45 ++++++++++--------
src/backend/executor/execParallel.c | 44 +++++++++--------
src/backend/utils/activity/wait_event.c | 63 +++++++++++++++++++++++--
src/include/utils/wait_event.h | 32 ++++++++-----
4 files changed, 129 insertions(+), 55 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee69d723cd8..0e2ec510fee 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -514,7 +514,6 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
- WaitEventUsage waitEventUsage;
WaitEventUsage *waitEventUsagePtr = NULL;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -593,9 +592,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
if (es->waits)
{
- waitEventUsagePtr = &waitEventUsage;
- pgstat_begin_wait_event_usage(waitEventUsagePtr,
- queryDesc->estate->es_query_cxt);
+ waitEventUsagePtr =
+ pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
}
@@ -4559,20 +4557,29 @@ static void
show_wait_event_usage(ExplainState *es, const char *labelname,
const WaitEventUsage *usage)
{
+ const WaitEventUsageEntry *usage_entries;
WaitEventUsageEntry *entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
+ int nentries;
if (usage == NULL)
return;
- if (usage->nentries == 0 && usage->overflowed_calls == 0)
+ if (pgstat_wait_event_usage_is_empty(usage))
return;
- if (usage->nentries > 0)
+ nentries = pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+ pgstat_get_wait_event_usage_overflow(usage,
+ &overflowed_calls,
+ &overflowed_time);
+
+ if (nentries > 0)
{
- entries = palloc_array(WaitEventUsageEntry, usage->nentries);
- memcpy(entries, usage->entries,
- sizeof(WaitEventUsageEntry) * usage->nentries);
- qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+ entries = palloc_array(WaitEventUsageEntry, nentries);
+ memcpy(entries, usage_entries,
+ sizeof(WaitEventUsageEntry) * nentries);
+ qsort(entries, nentries, sizeof(WaitEventUsageEntry),
wait_event_usage_cmp);
}
else
@@ -4584,7 +4591,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
appendStringInfo(es->str, "%s:\n", labelname);
es->indent++;
- for (int i = 0; i < usage->nentries; i++)
+ for (int i = 0; i < nentries; i++)
{
const char *event_type;
const char *event_name;
@@ -4600,24 +4607,24 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
INSTR_TIME_GET_MILLISEC(entries[i].time));
}
- if (usage->overflowed_calls > 0)
+ if (overflowed_calls > 0)
{
ExplainIndentText(es);
appendStringInfo(es->str,
"Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
- usage->overflowed_calls,
- INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+ overflowed_calls,
+ INSTR_TIME_GET_MILLISEC(overflowed_time));
}
es->indent--;
}
else
{
- if (usage->nentries > 0)
+ if (nentries > 0)
{
ExplainOpenGroup("Wait-Events", labelname, false, es);
- for (int i = 0; i < usage->nentries; i++)
+ for (int i = 0; i < nentries; i++)
{
const char *event_type;
const char *event_name;
@@ -4642,16 +4649,16 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
ExplainCloseGroup("Wait-Events", labelname, false, es);
}
- if (usage->overflowed_calls > 0)
+ if (overflowed_calls > 0)
{
/*
* This is not a wait event identity, so keep it outside the
* Wait Events array in structured output.
*/
ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
- usage->overflowed_calls, es);
+ overflowed_calls, es);
ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
- INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+ INSTR_TIME_GET_MILLISEC(overflowed_time),
3, es);
}
}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 520b4b8484f..dcd06c718c8 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1352,8 +1352,9 @@ ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
if (worker->overflowed_calls > 0)
{
- usage->overflowed_calls += worker->overflowed_calls;
- INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+ pgstat_accumulate_wait_event_usage_overflow(usage,
+ worker->overflowed_calls,
+ &worker->overflowed_time);
worker->overflowed_calls = 0;
INSTR_TIME_SET_ZERO(worker->overflowed_time);
}
@@ -1377,11 +1378,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
dsa_area *area,
const WaitEventUsage *usage)
{
+ const WaitEventUsageEntry *usage_entries;
WaitEventUsageEntry *entries;
WaitEventUsageEntry *old_entries = NULL;
dsa_pointer entries_dsa;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
Size entries_size;
int old_nentries = 0;
+ int usage_nentries;
int new_nentries = 0;
int i = 0;
int j = 0;
@@ -1390,10 +1395,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
Assert(area != NULL);
Assert(usage != NULL);
- worker->overflowed_calls += usage->overflowed_calls;
- INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+ usage_nentries =
+ pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+ pgstat_get_wait_event_usage_overflow(usage,
+ &overflowed_calls,
+ &overflowed_time);
+ worker->overflowed_calls += overflowed_calls;
+ INSTR_TIME_ADD(worker->overflowed_time, overflowed_time);
- if (usage->nentries <= 0)
+ if (usage_nentries <= 0)
return;
if (DsaPointerIsValid(worker->entries))
@@ -1404,25 +1414,25 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
}
entries_size = mul_size(sizeof(WaitEventUsageEntry),
- (Size) old_nentries + (Size) usage->nentries);
+ (Size) old_nentries + (Size) usage_nentries);
entries_dsa = dsa_allocate(area, entries_size);
entries = dsa_get_address(area, entries_dsa);
- while (i < old_nentries && j < usage->nentries)
+ while (i < old_nentries && j < usage_nentries)
{
WaitEventUsageEntry *entry = &entries[new_nentries];
uint32 old_info = old_entries[i].wait_event_info;
- uint32 new_info = usage->entries[j].wait_event_info;
+ uint32 new_info = usage_entries[j].wait_event_info;
if (old_info < new_info)
*entry = old_entries[i++];
else if (old_info > new_info)
- *entry = usage->entries[j++];
+ *entry = usage_entries[j++];
else
{
*entry = old_entries[i++];
- entry->calls += usage->entries[j].calls;
- INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+ entry->calls += usage_entries[j].calls;
+ INSTR_TIME_ADD(entry->time, usage_entries[j].time);
j++;
}
@@ -1431,8 +1441,8 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
while (i < old_nentries)
entries[new_nentries++] = old_entries[i++];
- while (j < usage->nentries)
- entries[new_nentries++] = usage->entries[j++];
+ while (j < usage_nentries)
+ entries[new_nentries++] = usage_entries[j++];
if (DsaPointerIsValid(worker->entries))
dsa_free(area, worker->entries);
@@ -1781,7 +1791,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
SharedJitInstrumentation *jit_instrumentation;
- WaitEventUsage waitEventUsage;
WaitEventUsage *waitEventUsagePtr = NULL;
int instrument_options = 0;
void *area_space;
@@ -1841,11 +1850,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
InstrStartParallelQuery();
if (wait_event_usage != NULL)
- {
- waitEventUsagePtr = &waitEventUsage;
- pgstat_begin_wait_event_usage(waitEventUsagePtr,
- queryDesc->estate->es_query_cxt);
- }
+ waitEventUsagePtr =
+ pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 67980cc0a3b..9719e38729e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,6 +36,17 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
static const char *pgstat_get_wait_ipc(WaitEventIPC w);
static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
+struct WaitEventUsage
+{
+ struct WaitEventUsage *active_parent; /* active plan-node stack link */
+ struct WaitEventUsage *query_parent; /* active query-level stack link */
+ struct WaitEventUsage *saved_node_usage; /* node stack at query start */
+ int nentries;
+ int maxentries;
+ WaitEventUsageEntry *entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
+};
static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
uint64 calls, const instr_time *elapsed);
static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
@@ -422,12 +433,12 @@ WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
* local memory. Nested top-level collectors are kept in a query-level stack;
* a wait is counted once in each active collector.
*/
-void
-pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+WaitEventUsage *
+pgstat_begin_wait_event_usage(MemoryContext memcontext)
{
+ WaitEventUsage *usage;
bool first;
- Assert(usage != NULL);
Assert(memcontext != NULL);
first = pgstat_wait_event_usage_depth == 0;
@@ -440,7 +451,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
- WaitEventUsageInit(usage, memcontext);
+ usage = pgstat_create_wait_event_usage(memcontext);
usage->query_parent = pgstat_wait_event_usage;
/*
* A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -451,6 +462,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
pgstat_wait_event_usage = usage;
pgstat_wait_event_usage_depth++;
pgstat_wait_event_usage_active = true;
+ return usage;
}
/*
@@ -579,6 +591,49 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
&entries[i].time);
}
+void
+pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+ uint64 calls,
+ const instr_time *elapsed)
+{
+ Assert(usage != NULL);
+ Assert(elapsed != NULL);
+
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+}
+
+bool
+pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage)
+{
+ Assert(usage != NULL);
+
+ return usage->nentries == 0 && usage->overflowed_calls == 0;
+}
+
+int
+pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+ const WaitEventUsageEntry **entries)
+{
+ Assert(usage != NULL);
+ Assert(entries != NULL);
+
+ *entries = usage->entries;
+ return usage->nentries;
+}
+
+void
+pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+ uint64 *calls,
+ instr_time *elapsed)
+{
+ Assert(usage != NULL);
+ Assert(calls != NULL);
+ Assert(elapsed != NULL);
+
+ *calls = usage->overflowed_calls;
+ *elapsed = usage->overflowed_time;
+}
+
/*
* Find the existing entry, or the insertion position for a new entry.
*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 67497790307..19763cfcae5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -15,6 +15,12 @@
#include "utils/palloc.h"
#include "utils/wait_event_types.h"
+/*
+ * EXPLAIN wait event accounting support. WaitEventUsage is intentionally
+ * opaque outside wait_event.c; callers should allocate, accumulate, and read
+ * it through the functions below. WaitEventUsageEntry is the reportable
+ * tuple copied to EXPLAIN output and parallel-worker storage.
+ */
typedef struct WaitEventUsageEntry
{
uint32 wait_event_info;
@@ -22,17 +28,7 @@ typedef struct WaitEventUsageEntry
instr_time time;
} WaitEventUsageEntry;
-typedef struct WaitEventUsage
-{
- struct WaitEventUsage *active_parent; /* active plan-node stack link */
- struct WaitEventUsage *query_parent; /* active query-level stack link */
- struct WaitEventUsage *saved_node_usage; /* node stack at query start */
- int nentries;
- int maxentries;
- WaitEventUsageEntry *entries;
- uint64 overflowed_calls;
- instr_time overflowed_time;
-} WaitEventUsage;
+typedef struct WaitEventUsage WaitEventUsage;
extern const char *pgstat_get_wait_event(uint32 wait_event_info);
extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
@@ -40,13 +36,23 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
+
+/* EXPLAIN wait event accounting. */
extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
-extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
- MemoryContext memcontext);
+extern WaitEventUsage *pgstat_begin_wait_event_usage(MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
const WaitEventUsageEntry *entries,
int nentries);
+extern void pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+ uint64 calls,
+ const instr_time *elapsed);
+extern bool pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage);
+extern int pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+ const WaitEventUsageEntry **entries);
+extern void pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+ uint64 *calls,
+ instr_time *elapsed);
extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_count_wait_event_start(uint32 wait_event_info);
--
2.52.0
---
src/bin/psql/tab-complete.in.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b33490e7415..45712f4ac12 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4525,8 +4525,8 @@ match_previous_words(int pattern_id,
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
"BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
- "MEMORY", "FORMAT");
- else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
+ "MEMORY", "IO", "FORMAT");
+ else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY|IO"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("SERIALIZE"))
COMPLETE_WITH("TEXT", "NONE", "BINARY");
--
2.52.0
On Sat, May 09, 2026 at 04:22:30AM +0500, Ilmar Yunusov wrote:
This RFC prototype adds `EXPLAIN (ANALYZE, WAITS)`, which reports
completed wait intervals observed through `pgstat_report_wait_start/end()`.
The option is named `WAITS` in this RFC to match the short style of
`BUFFERS`, `WAL`, `IO`, and `MEMORY`. I am not attached to the exact name;
`WAIT_EVENTS` may be clearer but is more verbose.
Rather than sending one email for each patch sent, the usual practice
on the Postgres community lists to send a series of patches in a
single email.
--
Michael
Following Michael Paquier's note about PostgreSQL list practice, this v1
resends the same patch content as v0 as a single email with the patch series
attached.
No code changes from v0.
v0 thread:
/messages/by-id/cover.1778280923.git.tanswis42@gmail.com
This RFC prototype adds EXPLAIN (ANALYZE, WAITS), which reports completed
wait
intervals observed through pgstat_report_wait_start/end().
Statement-level wait events are counted once per active statement collector.
Plan-node wait events use inclusive attribution, similar to EXPLAIN ANALYZE
timing: a completed wait is attributed to every active plan node captured
when
the wait begins. Per-node wait times therefore must not be summed across
plan
nodes.
The patch series is split as follows:
1. Add statement-level EXPLAIN WAITS reporting.
2. Aggregate statement-level EXPLAIN WAITS from parallel workers.
3. Attribute EXPLAIN WAITS to plan nodes.
4. Refine attribution semantics, documentation, and regression coverage.
5. Harden accumulator handling with fixed-size, allocation-free storage.
6. Hide accumulator internals from public headers.
7. Keep psql EXPLAIN option completion current.
The main review questions are:
- whether the option should be named WAITS or WAIT_EVENTS;
- whether inclusive per-node attribution is the right initial semantics;
- whether the fixed accumulator limit and overflow reporting are acceptable;
- whether the disabled/enabled hot-path overhead is acceptable.
Regards,
Ilmar
Attachments:
0001-Add-EXPLAIN-WAITS-statement-reporting.patchapplication/octet-stream; name=0001-Add-EXPLAIN-WAITS-statement-reporting.patchDownload+349-4
0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patchapplication/octet-stream; name=0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patchDownload+190-4
0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patchapplication/octet-stream; name=0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patchDownload+333-28
0004-Refine-EXPLAIN-WAITS-attribution-semantics.patchapplication/octet-stream; name=0004-Refine-EXPLAIN-WAITS-attribution-semantics.patchDownload+574-105
0005-Harden-EXPLAIN-WAITS-accumulator-handling.patchapplication/octet-stream; name=0005-Harden-EXPLAIN-WAITS-accumulator-handling.patchDownload+56-83
0006-Hide-EXPLAIN-WAITS-accumulator-internals.patchapplication/octet-stream; name=0006-Hide-EXPLAIN-WAITS-accumulator-internals.patchDownload+129-56
0007-Keep-EXPLAIN-option-completion-current.patchapplication/octet-stream; name=0007-Keep-EXPLAIN-option-completion-current.patchDownload+2-3
This v2 keeps the same RFC feature scope as v1 and changes only regression
test coverage/stability.
v1 thread:
/messages/by-id/CALCfnuquuxtZmmzQBZ_yxaihfj7bnALXdzi9Nj=RYUW4iwY6GQ@mail.gmail.com
v0 thread:
/messages/by-id/cover.1778280923.git.tanswis42@gmail.com
The CFBot FreeBSD run showed that the regression tests assumed a too narrow
statement-level wait list. EXPLAIN WAITS can validly observe additional
statement-level waits around the measured query, for example
parallel-executor
IPC waits or DSM allocation waits. Those are valid observed waits, not an
accounting bug.
Changes in v2:
1. Make the text-output test check for the required Wait Events and
Statement
Wait Events lines, instead of expecting the full statement-level wait
list
to contain only Timeout:PgSleep.
2. Make JSON tests find Timeout:PgSleep by JSONPath instead of assuming it
is
the first wait event array element.
3. Disable debug_parallel_query and default gather workers in the explain
regression test before serial EXPLAIN checks.
4. Disable debug_parallel_query and gather workers in the bitmap runtime-key
attribution test.
5. Remove the plain-regression assertion for rescanned parallel worker wait
aggregation for now. Worker availability and the exact parallel plan
shape
are not deterministic enough for that test under the parallel regression
harness. The accounting behavior is still implemented, but this specific
edge should come back as a more isolated test if we can make it
deterministic
enough for CFBot.
There are no accounting-code changes from v1.
The main RFC questions are unchanged:
- whether the option should be named WAITS or WAIT_EVENTS;
- whether inclusive per-node attribution is the right initial semantics;
- whether the fixed accumulator limit and overflow reporting are acceptable;
- whether the disabled/enabled hot-path overhead is acceptable.
Local verification:
make -C src/test/regress check TESTS=explain
All 245 tests passed.
git diff --check
passed.
Regards,
Ilmar