Proposal: In-flight explain logging
Hello hackers,
# MOTIVATION
My recent experiences with problematic queries in customers motivated
me to write this patch proposing a new feature to enhance visibility
on what active queries are doing.
PostgreSQL already offers 2 very powerful tools for query troubleshooting:
- EXPLAIN: gives us hints on potential bottlenecks in an execution plan.
- EXPLAIN ANALYZE: shows precisely where bottlenecks are, but the query
must finish.
In my humble opinion we are missing something in the middle. Having
visibility
over in-flight queries would provide more insights than a plain EXPLAIN
and would allow us to analyze super problematic queries that never finish
a EXPLAIN ANALYZE execution.
Considering that every active query has an execution plan, the new feature
can target not only controlled EXPLAIN statements but also any query in
progress. This allows us to identify if a slow active query is using a
different plan and why (for example, custom settings set a session level
that are currently only visible to the backend).
# PROPOSAL
The feature works similarly to the recently introduced
pg_log_backend_memory_contexts().
The patch adds function pg_log_backend_explain_plan(PID) to be executed as
superuser in a second backend to signal the target backend to print
execution
plan details in the log.
For regular queries (called without instrumentation) PG will log the plain
explain along with useful details like custom settings.
When targeting a query with instrumentation enabled PG will log the complete
EXPLAIN ANALYZE plan with current row count and, if enabled, timing for each
node. Considering that the query is in progress the output will include the
following per node:
- (never executed) for nodes that weren't touched yet (or
may never be).
- (in progress) for nodes currently being executed, ie,
InstrStartNode was called and clock is ticking there.
Parallel workers can be targeted too, where PG will log only the relevant
part
of the complete execution plan.
# DEMONSTRATION
a) Targeting a not valid PG process:
postgres=# select pg_log_backend_explain_plan(1);
WARNING: PID 1 is not a PostgreSQL server process
pg_log_backend_explain_plan
-----------------------------
f
(1 row)
b) Targeting a PG process not running a query:
postgres=# select pg_log_backend_explain_plan(24103);
pg_log_backend_explain_plan
-----------------------------
t
(1 row)
2023-12-02 16:30:19.979 UTC [24103] LOG: PID 24103 not executing a
statement with in-flight explain logging enabled
c) Targeting an active query without any instrumentation:
postgres=# select pg_log_backend_explain_plan(24103);
pg_log_backend_explain_plan
-----------------------------
t
(1 row)
2023-12-02 16:33:10.968 UTC [24103] LOG: logging explain plan of PID 24103
Query Text: select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Gather (cost=70894.63..202643.27 rows=1000000 width=20)
Workers Planned: 2
-> Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20)
Hash Cond: (a.c1 = e.c1)
-> Parallel Hash Join (cost=54466.62..77218.65 rows=416667
width=16)
Hash Cond: (a.c1 = c.c1)
-> Parallel Hash Join (cost=15428.00..29997.42 rows=416667
width=8)
Hash Cond: (b.c1 = a.c1)
-> Parallel Seq Scan on t1 b (cost=0.00..8591.67
rows=416667 width=4)
-> Parallel Hash (cost=8591.67..8591.67 rows=416667
width=4)
-> Parallel Seq Scan on t2 a (cost=0.00..8591.67
rows=416667 width=4)
-> Parallel Hash (cost=32202.28..32202.28 rows=416667
width=8)
-> Parallel Hash Join (cost=15428.00..32202.28
rows=416667 width=8)
Hash Cond: (c.c1 = d.c1)
-> Parallel Seq Scan on t1 c (cost=0.00..8591.67
rows=416667 width=4)
-> Parallel Hash (cost=8591.67..8591.67
rows=416667 width=4)
-> Parallel Seq Scan on t1 d
(cost=0.00..8591.67 rows=416667 width=4)
-> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4)
-> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667
width=4)
Settings: max_parallel_workers_per_gather = '4'
d) Targeting a parallel query (and its parallel workers) with
instrumentation:
postgres=# select pid, backend_type,pg_log_backend_explain_plan(pid)
postgres=# from pg_stat_activity
postgres=# where (backend_type = 'client backend' and pid !=
pg_backend_pid())
postgres=# or backend_type = 'parallel worker';
pid | backend_type | pg_log_backend_explain_plan
-------+-----------------+-----------------------------
24103 | client backend | t
24389 | parallel worker | t
24390 | parallel worker | t
(3 rows)
2023-12-02 16:36:34.840 UTC [24103] LOG: logging explain plan of PID 24103
Query Text: explain (analyze, buffers)
select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Gather (cost=70894.63..202643.27 rows=1000000 width=20) (never executed)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20)
(never executed)
Hash Cond: (a.c1 = e.c1)
-> Parallel Hash Join (cost=54466.62..77218.65 rows=416667
width=16) (never executed)
Hash Cond: (a.c1 = c.c1)
-> Parallel Hash Join (cost=15428.00..29997.42 rows=416667
width=8) (never executed)
Hash Cond: (b.c1 = a.c1)
-> Parallel Seq Scan on t1 b (cost=0.00..8591.67
rows=416667 width=4) (never executed)
-> Parallel Hash (cost=8591.67..8591.67 rows=416667
width=4) (never executed)
-> Parallel Seq Scan on t2 a (cost=0.00..8591.67
rows=416667 width=4) (never executed)
-> Parallel Hash (cost=32202.28..32202.28 rows=416667
width=8) (never executed)
-> Parallel Hash Join (cost=15428.00..32202.28
rows=416667 width=8) (never executed)
Hash Cond: (c.c1 = d.c1)
-> Parallel Seq Scan on t1 c (cost=0.00..8591.67
rows=416667 width=4) (never executed)
-> Parallel Hash (cost=8591.67..8591.67
rows=416667 width=4) (never executed)
-> Parallel Seq Scan on t1 d
(cost=0.00..8591.67 rows=416667 width=4) (actual time=0.023..8.688
rows=107903 loops=1) (in progress)
Buffers: shared hit=46 read=432
-> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4)
(actual time=607.171..607.171 rows=341486 loops=1)
Buffers: shared hit=717 read=794, temp written=896
-> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667
width=4) (actual time=0.009..20.413 rows=341486 loops=1)
Buffers: shared hit=717 read=794
Settings: max_parallel_workers_per_gather = '4'
2023-12-02 16:36:34.841 UTC [24389] LOG: logging explain plan of PID 24389
Query Text: explain (analyze, buffers)
select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20) (never
executed)
Hash Cond: (a.c1 = e.c1)
-> Parallel Hash Join (cost=54466.62..77218.65 rows=416667 width=16)
(never executed)
Hash Cond: (a.c1 = c.c1)
-> Parallel Hash Join (cost=15428.00..29997.42 rows=416667
width=8) (never executed)
Hash Cond: (b.c1 = a.c1)
-> Parallel Seq Scan on t1 b (cost=0.00..8591.67 rows=416667
width=4) (never executed)
-> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4)
(never executed)
-> Parallel Seq Scan on t2 a (cost=0.00..8591.67
rows=416667 width=4) (never executed)
-> Parallel Hash (cost=32202.28..32202.28 rows=416667 width=8)
(never executed)
-> Parallel Hash Join (cost=15428.00..32202.28 rows=416667
width=8) (never executed)
Hash Cond: (c.c1 = d.c1)
-> Parallel Seq Scan on t1 c (cost=0.00..8591.67
rows=416667 width=4) (never executed)
-> Parallel Hash (cost=8591.67..8591.67 rows=416667
width=4) (never executed)
-> Parallel Seq Scan on t1 d (cost=0.00..8591.67
rows=416667 width=4) (actual time=0.024..7.486 rows=99146 loops=1) (in
progress)
Buffers: shared hit=43 read=396
-> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual
time=595.768..595.768 rows=329056 loops=1)
Buffers: shared hit=752 read=704, temp written=868
-> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667
width=4) (actual time=0.003..20.849 rows=329056 loops=1)
Buffers: shared hit=752 read=704
Settings: max_parallel_workers_per_gather = '4'
2023-12-02 16:36:34.844 UTC [24390] LOG: logging explain plan of PID 24390
Query Text: explain (analyze, buffers)
select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20) (never
executed)
Hash Cond: (a.c1 = e.c1)
-> Parallel Hash Join (cost=54466.62..77218.65 rows=416667 width=16)
(never executed)
Hash Cond: (a.c1 = c.c1)
-> Parallel Hash Join (cost=15428.00..29997.42 rows=416667
width=8) (never executed)
Hash Cond: (b.c1 = a.c1)
-> Parallel Seq Scan on t1 b (cost=0.00..8591.67 rows=416667
width=4) (never executed)
-> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4)
(never executed)
-> Parallel Seq Scan on t2 a (cost=0.00..8591.67
rows=416667 width=4) (never executed)
-> Parallel Hash (cost=32202.28..32202.28 rows=416667 width=8)
(never executed)
-> Parallel Hash Join (cost=15428.00..32202.28 rows=416667
width=8) (never executed)
Hash Cond: (c.c1 = d.c1)
-> Parallel Seq Scan on t1 c (cost=0.00..8591.67
rows=416667 width=4) (never executed)
-> Parallel Hash (cost=8591.67..8591.67 rows=416667
width=4) (never executed)
-> Parallel Seq Scan on t1 d (cost=0.00..8591.67
rows=416667 width=4) (actual time=0.005..7.186 rows=98901 loops=1) (in
progress)
Buffers: shared hit=11 read=427
-> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual
time=594.224..594.224 rows=329458 loops=1)
Buffers: shared hit=708 read=750, temp written=864
-> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667
width=4) (actual time=0.955..21.233 rows=329458 loops=1)
Buffers: shared hit=708 read=750
Settings: max_parallel_workers_per_gather = '4'
# IMPLEMENTATION DETAILS
- Process signaling
The whole process signaling implementation is identical to the logic done
for pg_log_backend_memory_contexts(). After signaling a process, the
ultimate
function called to perform the plan logging is
ProcessLogExplainPlanInterrupt()
in explain.c.
- How to track a currently running query?
Explain plans are printed via a QueryDesc structure so we need to be able
to access that object for the currently running query.
For a simple select query, where the QueryDesc is created here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/tcop/pquery.c#L495
)
the QueryDesc is accessible via the global ActivePortal pointer as the
objects is stored here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/tcop/pquery.c#L522
).
The problem is that for EXPLAIN commands the QueryDesc created here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/commands/explain.c#L575
)
isn't accessible externally. It exists only in that code context.
So my solution was to have a global pointer in explain.c that is either
NULL or is pointed to the currently active QueryDesc. At the end of
standard_ExecutorStart() in execMain.c I call new function
ExplainTrackQuery(QueryDesc)
in explain.c that will take care of pointing the global pointer to the
QueryDesc instance.
This is an important part of the code. The overhead of the implementation
is that every query will do the new logic of assigning the global pointer
and making sure pointer is always valid (see next section).
- How to make sure the new global pointer is always valid?
The global pointer starts as NULL, gets assigned via
ExplainTrackQuery(QueryDesc)
and gets cleared with the help of a MemoryContextCallback.
The strategy there is that a MemoryContextCallback will be assigned
in the same MemoryContext where the tracked QueryDesc was created. When
the MemoryContext is gone (executor is complete) the QueryDesc instance
will be destroyed and function QueryDescReleaseFunc() in explain.c will
be called to clear the global pointer. With that we can make sure that
the pointer always get cleared, even if the query gets cancelled.
- Safely printing in-flight execution plans
A plan string is built in function ExplainNode here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/commands/explain.c#L1178
)
which is called at the end of a query execution when EXPLAIN is used.
That function performs logic using a PlanState (part of QueryDesc) of
the running query and a ExplainState.
The main challenge there is that ExplainNode calls InstrEndLoop which
changes values in Instrumentation. This is ok for a regular EXPLAIN
where the query is already complete but not ok for the new feature with
in-flight explains.
So the new code has custom logic to clone Instrumentation instance of
the current node. The cloned object can be safely written.
Function InstrEndLoop has a safety rule here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/executor/instrument.c#L148
)
that prevents adjusting instrumentation details in a running node. This
never happens in the current code logic but with the new in-flight
explain it will happen very often.
I didn't want to remove this safety rule as InstrEndLoop gets called in
other places too (even in auto_explain) so the solution was to keep
InstrEndLoop and have a new InstrEndLoopForce for the in-flight explain.
Both InstrEndLoop and InstrEndLoopForce call a new internal
InstrEndLoopInternal to avoid duplicating the code.
- Memory management
Considering that pg_log_backend_explain_plan() can be called indefinite
times in the same query execution, all allocated objects in the new
implementation (via palloc) are manually deallocated. This avoids private
memory to keep growing until MemoryContext is released.
- ExplainState customization
A ExplainState is allocated and customized for the in-flight logging.
Instrumentation related settings are enabled based on how the target
query started, which is usually via EXPLAIN ANALYZE or with auto_explain.
es = NewExplainState();
es->in_flight = true;
es->analyze = currentQueryDesc->instrument_options;
es->buffers = (currentQueryDesc->instrument_options &
INSTRUMENT_BUFFERS) != 0;
es->wal = (currentQueryDesc->instrument_options &
INSTRUMENT_WAL) != 0;
es->timing = (currentQueryDesc->instrument_options &
INSTRUMENT_TIMER) != 0;
There are other settings that I currently selected some static values
for testing:
es->summary = (es->analyze);
es->format = EXPLAIN_FORMAT_TEXT;
es->verbose = false;
es->settings = true;
For those we can think about customizations like global settings or
passing through attributes in pg_log_backend_explain_plan(). There
is definitely room for improvement here.
- Implementation overhead
As mentioned earlier, the new feature adds overhead by having
to adjust the query desc global pointer in every QueryDesc that
passes through standard_ExecutorStart(). If we think this is not
a good idea we can think about moving ExplainTrackQuery(QueryDesc)
to specific QueryDesc allocations, like having the feature just
for EXPLAIN commands. But that would limit what we can inspect.
# FINAL CONSIDERATIONS
This should be enough for an initial proposal. Apologies for the huge
mail. This is my first patch so I am probably missing a lot of standards
and good practices the community is already familiar with.
Plus, I haven't implemented any tests yet. If you think it is worth
considering this new feature I can work on them.
Kind Regards,
Rafael Castro.
Attachments:
v1-0001-Proposal-for-in-flight-explain-logging.patchapplication/octet-stream; name=v1-0001-Proposal-for-in-flight-explain-logging.patchDownload
From 38b5c78cda545fd210438016fb9d1c8d57aba86c Mon Sep 17 00:00:00 2001
From: Rafael Castro <rafaelthca@gmail.com>
Date: Fri, 1 Dec 2023 17:51:57 -0300
Subject: [PATCH v1] Proposal for in-flight explain logging.
This proposal introduces function pg_log_backend_explain_plan(PID)
to signal a backend process to log the explain plan of an query
currently in process there.
For regular queries (called without instrumentation) PG will log
the plain explain along with useful details like custom settings.
When targeting a query with instrumentation enabled PG will log
the complete EXPLAIN ANALYZE plan with current row count for each
node, current per node timing, etc. Considering that this query is
in progress the output will include the following per node:
- (never executed) for nodes that weren't touched yet (or
may never be).
- (in progress) for nodes currently being executed, ie,
InstrStartNode was called and clock is ticking there.
Parallel workers can be targeted too, where PG will log only the
relevant part of the complete execution plan.
---
src/backend/commands/explain.c | 294 +++++++++++++++++++++++++--
src/backend/executor/execMain.c | 6 +
src/backend/executor/instrument.c | 20 +-
src/backend/storage/ipc/procsignal.c | 3 +
src/backend/tcop/postgres.c | 3 +
src/backend/utils/init/globals.c | 1 +
src/include/catalog/pg_proc.dat | 6 +
src/include/commands/explain.h | 6 +
src/include/executor/instrument.h | 1 +
src/include/miscadmin.h | 1 +
src/include/storage/procsignal.h | 1 +
11 files changed, 318 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e..e2d3c4d597 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -39,6 +39,7 @@
#include "utils/tuplesort.h"
#include "utils/typcache.h"
#include "utils/xml.h"
+#include "storage/procarray.h"
/* Hook for plugins to get control in ExplainOneQuery() */
@@ -47,6 +48,9 @@ ExplainOneQuery_hook_type ExplainOneQuery_hook = NULL;
/* Hook for plugins to get control in explain_get_index_name() */
explain_get_index_name_hook_type explain_get_index_name_hook = NULL;
+/* Pointer to the currently running query descriptor for in-flight explain logging. */
+static QueryDesc *queryDescGlobal;
+
/* OR-able flags for ExplainXMLTag() */
#define X_OPENING 0
@@ -153,6 +157,7 @@ static void ExplainIndentText(ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
static void escape_yaml(StringInfo buf, const char *str);
+static void QueryDescReleaseFunc(void *);
@@ -325,6 +330,8 @@ NewExplainState(void)
es->costs = true;
/* Prepare output buffer. */
es->str = makeStringInfo();
+ /* An explain state is not in-flight by default */
+ es->in_flight = false;
return es;
}
@@ -1195,6 +1202,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
const char *operation = NULL;
const char *custom_name = NULL;
ExplainWorkersState *save_workers_state = es->workers_state;
+ Instrumentation *local_instr = NULL;
int save_indent = es->indent;
bool haschildren;
@@ -1653,28 +1661,57 @@ ExplainNode(PlanState *planstate, List *ancestors,
* instrumentation results the user didn't ask for. But we do the
* InstrEndLoop call anyway, if possible, to reduce the number of cases
* auto_explain has to contend with.
+ *
+ * For regular explains instrumentation clean up is called directly in
+ * the main instrumentation objects. In-flight explains need to clone
+ * instrumentation object and forcibly end the loop in nodes that may
+ * be running.
*/
- if (planstate->instrument)
- InstrEndLoop(planstate->instrument);
+ if (planstate->instrument) {
+ /* In flight explain. Clone instrumentation */
+ if (es->in_flight) {
+ local_instr = palloc0(sizeof(*local_instr));
+ *local_instr = *planstate->instrument;
+ /* Force end loop even if node is in progress */
+ InstrEndLoopForce(local_instr);
+ }
+ /* Use main instrumentation */
+ else {
+ local_instr = planstate->instrument;
+ InstrEndLoop(local_instr);
+ }
+ }
if (es->analyze &&
- planstate->instrument && planstate->instrument->nloops > 0)
+ local_instr && local_instr->nloops > 0)
{
- double nloops = planstate->instrument->nloops;
- double startup_ms = 1000.0 * planstate->instrument->startup / nloops;
- double total_ms = 1000.0 * planstate->instrument->total / nloops;
- double rows = planstate->instrument->ntuples / nloops;
+ double nloops = local_instr->nloops;
+ double startup_ms = 1000.0 * local_instr->startup / nloops;
+ double total_ms = 1000.0 * local_instr->total / nloops;
+ double rows = local_instr->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
{
if (es->timing)
- appendStringInfo(es->str,
+ /* Node in progress */
+ if (!INSTR_TIME_IS_ZERO(planstate->instrument->starttime))
+ appendStringInfo(es->str,
+ " (actual time=%.3f..%.3f rows=%.0f loops=%.0f) (in progress)",
+ startup_ms, total_ms, rows, nloops);
+ else
+ appendStringInfo(es->str,
" (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
startup_ms, total_ms, rows, nloops);
else
- appendStringInfo(es->str,
- " (actual rows=%.0f loops=%.0f)",
- rows, nloops);
+ /* Node in progress */
+ if (!INSTR_TIME_IS_ZERO(planstate->instrument->starttime))
+ appendStringInfo(es->str,
+ " (actual rows=%.0f loops=%.0f) (in progress)",
+ rows, nloops);
+ else
+ appendStringInfo(es->str,
+ " (actual rows=%.0f loops=%.0f)",
+ rows, nloops);
}
else
{
@@ -1810,7 +1847,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
if (es->analyze)
ExplainPropertyFloat("Heap Fetches", NULL,
- planstate->instrument->ntuples2, 0, es);
+ local_instr->ntuples2, 0, es);
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2105,10 +2142,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
/* Show buffer/WAL usage */
- if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage, false);
- if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ if (es->buffers && local_instr)
+ show_buffer_usage(es, &local_instr->bufusage, false);
+ if (es->wal && local_instr)
+ show_wal_usage(es, &local_instr->walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -2247,6 +2284,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainCloseGroup("Plan",
relationship ? NULL : "Plan",
true, es);
+
+ /* In flight explain. Free cloned instrumentation object */
+ if (es->in_flight && local_instr) {
+ pfree(local_instr);
+ }
}
/*
@@ -4089,11 +4131,22 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
double total;
double insert_path;
double other_path;
-
- InstrEndLoop(outerPlanState(mtstate)->instrument);
+ Instrumentation *local_instr;
+
+ /* In flight explain. Clone instrumentation */
+ if (es->in_flight) {
+ local_instr = palloc0(sizeof(*local_instr));
+ *local_instr = *outerPlanState(mtstate)->instrument;
+ /* Force end loop even if node is in progress */
+ InstrEndLoopForce(local_instr);
+ }
+ else {
+ local_instr = outerPlanState(mtstate)->instrument;
+ InstrEndLoop(local_instr);
+ }
/* count the number of source rows */
- total = outerPlanState(mtstate)->instrument->ntuples;
+ total = local_instr->ntuples;
other_path = mtstate->ps.instrument->ntuples2;
insert_path = total - other_path;
@@ -4101,6 +4154,10 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
insert_path, 0, es);
ExplainPropertyFloat("Conflicting Tuples", NULL,
other_path, 0, es);
+
+ /* In flight explain. Free cloned instrumentation object */
+ if (es->in_flight)
+ pfree(local_instr);
}
}
else if (node->operation == CMD_MERGE)
@@ -4113,11 +4170,22 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
double update_path;
double delete_path;
double skipped_path;
-
- InstrEndLoop(outerPlanState(mtstate)->instrument);
+ Instrumentation *local_instr;
+
+ /* In flight explain. Clone instrumentation */
+ if (es->in_flight) {
+ local_instr = palloc0(sizeof(*local_instr));
+ *local_instr = *outerPlanState(mtstate)->instrument;
+ /* Force end loop even if node is in progress */
+ InstrEndLoopForce(local_instr);
+ }
+ else {
+ local_instr = outerPlanState(mtstate)->instrument;
+ InstrEndLoop(local_instr);
+ }
/* count the number of source rows */
- total = outerPlanState(mtstate)->instrument->ntuples;
+ total = local_instr->ntuples;
insert_path = mtstate->mt_merge_inserted;
update_path = mtstate->mt_merge_updated;
delete_path = mtstate->mt_merge_deleted;
@@ -4148,6 +4216,10 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
ExplainPropertyFloat("Tuples Deleted", NULL, delete_path, 0, es);
ExplainPropertyFloat("Tuples Skipped", NULL, skipped_path, 0, es);
}
+
+ /* In flight explain. Free cloned instrumentation object */
+ if (es->in_flight)
+ pfree(local_instr);
}
}
@@ -5082,3 +5154,181 @@ escape_yaml(StringInfo buf, const char *str)
{
escape_json(buf, str);
}
+
+/*
+ * pg_log_backend_explain_plan
+ * Signal a backend or an auxiliary process to log its execution plan.
+ *
+ * By default, only superusers are allowed to signal to log the execution
+ * plans because allowing any users to issue this request at an unbounded
+ * rate would cause lots of log messages and which can lead to denial of
+ * service. Additional roles can be permitted with GRANT.
+ *
+ * On receipt of this signal, a backend or an auxiliary process sets the
+ * flag in the signal handler, which causes the next CHECK_FOR_INTERRUPTS()
+ * or process-specific interrupt handler to log the explain execution plan.
+ */
+Datum
+pg_log_backend_explain_plan(PG_FUNCTION_ARGS)
+{
+ int pid = PG_GETARG_INT32(0);
+ PGPROC *proc = BackendPidGetProc(pid);
+
+ /* Check if target is a valid PG process. */
+ if (proc == NULL)
+ {
+ /*
+ * This is just a warning so a loop-through-resultset will not abort
+ * if one backend terminated on its own during the run.
+ */
+ ereport(WARNING,
+ (errmsg("PID %d is not a PostgreSQL server process", pid)));
+ PG_RETURN_BOOL(false);
+ }
+
+ /* Only allow superusers to log explain execution plans. */
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be a superuser to log explain plans")));
+
+ if (SendProcSignal(pid, PROCSIG_LOG_EXPLAIN_PLAN, proc->backendId) < 0)
+ {
+ /* Again, just a warning to allow loops */
+ ereport(WARNING,
+ (errmsg("could not send signal to process %d: %m", pid)));
+ PG_RETURN_BOOL(false);
+ }
+
+ PG_RETURN_BOOL(true);
+}
+
+/*
+ * HandleLogExplainPlanInterrupt
+ * Handle receipt of an interrupt indicating logging of explain
+ * execution plan.
+ *
+ * All the actual work is deferred to ProcessLogExplainPlanInterrupt(),
+ * because we cannot safely emit a log message inside the signal handler.
+ */
+void
+HandleLogExplainPlanInterrupt(void)
+{
+ InterruptPending = true;
+ LogExplainPlanPending = true;
+ /* latch will be set by procsignal_sigusr1_handler */
+}
+
+/*
+ * ProcessLogExplainPlanInterrupt
+ * Perform logging of explain execution plan of this backend process.
+ *
+ * Any backend that participates in ProcSignal signaling must arrange
+ * to call this function if we see LogExplainPlanPending set.
+ * It is called from CHECK_FOR_INTERRUPTS(), which is enough because
+ * the target process for logging of explain plans is a backend.
+ */
+void
+ProcessLogExplainPlanInterrupt(void)
+{
+ ExplainState *es;
+ /* Signal is consumed */
+ LogExplainPlanPending = false;
+
+ /* Produce a plan only if query isn't finished yet */
+ if (queryDescGlobal &&
+ queryDescGlobal->planstate &&
+ !queryDescGlobal->planstate->state->es_finished)
+ {
+ QueryDesc *currentQueryDesc = queryDescGlobal;
+
+ es = NewExplainState();
+ es->in_flight = true;
+ es->analyze = currentQueryDesc->instrument_options;
+ es->buffers = (currentQueryDesc->instrument_options &
+ INSTRUMENT_BUFFERS) != 0;
+ es->wal = (currentQueryDesc->instrument_options &
+ INSTRUMENT_WAL) != 0;
+ es->timing = (currentQueryDesc->instrument_options &
+ INSTRUMENT_TIMER) != 0;
+
+ /*
+ * Customizable parameters. This could come from GUC variables
+ * or parameters of pg_log_backend_explain_plan() function.
+ */
+ es->summary = (es->analyze);
+ es->format = EXPLAIN_FORMAT_TEXT;
+ es->verbose = false;
+ es->settings = true;
+
+ ExplainBeginOutput(es);
+ ExplainQueryText(es, currentQueryDesc);
+ ExplainPrintPlan(es, currentQueryDesc);
+ ExplainEndOutput(es);
+
+ /*
+ * Use LOG_SERVER_ONLY to prevent this message from being sent
+ * to the connected client.
+ */
+ ereport(LOG_SERVER_ONLY,
+ (errhidestmt(true),
+ errhidecontext(true),
+ errmsg("logging explain plan of PID %d\n%s",
+ MyProcPid, es->str->data)));
+
+ /*
+ * Free local explain state before exiting as this function
+ * may be called multiple times in the same memory context.
+ */
+ pfree(es->str);
+ pfree(es);
+ }
+ else {
+ ereport(LOG_SERVER_ONLY,
+ (errhidestmt(true),
+ errhidecontext(true),
+ errmsg("PID %d not executing a statement with in-flight explain logging enabled",
+ MyProcPid)));
+ }
+}
+
+/*
+ * ExplainTrackQuery
+ * Enables in-flight explain logging for a query in the local backend.
+ *
+ * An in-flight explain logging is requested by another process via signal
+ * so we need to keep track of the target query descriptor in a global pointer.
+ *
+ * This pointer needs to be properly cleared when the query descriptor is
+ * gone, which is achieved with the help of a memory context callback configured
+ * in the same memory context where the query descriptor was created. This
+ * strategy allows clearing the pointer even when the query gets cancelled.
+ */
+void
+ExplainTrackQuery(QueryDesc *queryDesc)
+{
+ /*
+ * Don't track a query descriptor when there is another tracking
+ * in progress.
+ */
+ if (queryDescGlobal == NULL) {
+ MemoryContextCallback *queryDescReleaseCallback;
+ queryDescReleaseCallback = (MemoryContextCallback *)
+ palloc0(sizeof(MemoryContextCallback));
+ queryDescReleaseCallback->func = QueryDescReleaseFunc;
+ queryDescReleaseCallback->arg = NULL;
+ MemoryContextRegisterResetCallback(CurrentMemoryContext,
+ queryDescReleaseCallback);
+
+ queryDescGlobal = queryDesc;
+ }
+}
+
+/*
+ * QueryDescReleaseFunc
+ * Memory context release callback function to clear the global pointer.
+ */
+static void
+QueryDescReleaseFunc(void *) {
+ queryDescGlobal = NULL;
+}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4c5a7bbf62..83e3b5c6ef 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -67,6 +67,7 @@
#include "utils/rls.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
+#include "commands/explain.h"
/* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
@@ -266,6 +267,11 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
InitPlan(queryDesc, eflags);
MemoryContextSwitchTo(oldcontext);
+
+ /*
+ * Track query descriptor for in-flight explains
+ */
+ ExplainTrackQuery(queryDesc);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c383f34c06..7000e4d8bc 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -25,6 +25,8 @@ static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+static void InstrEndLoopInternal(Instrumentation *instr, bool force);
+
/* Allocate new instrumentation structure(s) */
Instrumentation *
@@ -137,7 +139,7 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoopInternal(Instrumentation *instr, bool force)
{
double totaltime;
@@ -145,7 +147,7 @@ InstrEndLoop(Instrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->starttime) && !force)
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
@@ -164,6 +166,20 @@ InstrEndLoop(Instrumentation *instr)
instr->tuplecount = 0;
}
+/* Safely finish a run cycle for a plan node */
+void
+InstrEndLoop(Instrumentation *instr)
+{
+ InstrEndLoopInternal(instr, false);
+}
+
+/* Forcibly finish a run cycle for a plan node */
+void
+InstrEndLoopForce(Instrumentation *instr)
+{
+ InstrEndLoopInternal(instr, true);
+}
+
/* aggregate instrumentation information */
void
InstrAggNode(Instrumentation *dst, Instrumentation *add)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index b7427906de..9183415f16 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -658,6 +658,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
if (CheckProcSignal(PROCSIG_LOG_MEMORY_CONTEXT))
HandleLogMemoryContextInterrupt();
+ if (CheckProcSignal(PROCSIG_LOG_EXPLAIN_PLAN))
+ HandleLogExplainPlanInterrupt();
+
if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
HandleParallelApplyMessageInterrupt();
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7298a187d1..7e1bad8517 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3457,6 +3457,9 @@ ProcessInterrupts(void)
if (LogMemoryContextPending)
ProcessLogMemoryContextInterrupt();
+ if (LogExplainPlanPending)
+ ProcessLogExplainPlanInterrupt();
+
if (ParallelApplyMessagePending)
HandleParallelApplyMessages();
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 60bc1217fb..25fec466d9 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -36,6 +36,7 @@ volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
volatile sig_atomic_t IdleSessionTimeoutPending = false;
volatile sig_atomic_t ProcSignalBarrierPending = false;
volatile sig_atomic_t LogMemoryContextPending = false;
+volatile sig_atomic_t LogExplainPlanPending = false;
volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
volatile uint32 InterruptHoldoffCount = 0;
volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..c2c1a70f86 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12100,4 +12100,10 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+# logging explain plan details of the specified backend
+{ oid => '8769', descr => 'log explain plan of the specified backend',
+ proname => 'pg_log_backend_explain_plan', provolatile => 'v',
+ prorettype => 'bool', proargtypes => 'int4',
+ prosrc => 'pg_log_backend_explain_plan' },
+
]
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index f9525fb572..6098195cc4 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -16,6 +16,7 @@
#include "executor/executor.h"
#include "lib/stringinfo.h"
#include "parser/parse_node.h"
+#include "miscadmin.h"
typedef enum ExplainFormat
{
@@ -58,6 +59,7 @@ typedef struct ExplainState
List *deparse_cxt; /* context list for deparsing expressions */
Bitmapset *printed_subplans; /* ids of SubPlans we've printed */
bool hide_workers; /* set if we find an invisible Gather */
+ bool in_flight; /* set if tracking an active statement */
/* state related to the current plan node */
ExplainWorkersState *workers_state; /* needed if parallel plan */
} ExplainState;
@@ -126,4 +128,8 @@ extern void ExplainOpenGroup(const char *objtype, const char *labelname,
extern void ExplainCloseGroup(const char *objtype, const char *labelname,
bool labeled, ExplainState *es);
+extern void HandleLogExplainPlanInterrupt(void);
+extern void ProcessLogExplainPlanInterrupt(void);
+extern void ExplainTrackQuery(QueryDesc *queryDesc);
+
#endif /* EXPLAIN_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d5d69941c5..b7c98dfd78 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -108,6 +108,7 @@ extern void InstrStartNode(Instrumentation *instr);
extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrEndLoopForce(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..a8f581325d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,6 +94,7 @@ extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
extern PGDLLIMPORT volatile sig_atomic_t LogMemoryContextPending;
+extern PGDLLIMPORT volatile sig_atomic_t LogExplainPlanPending;
extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t CheckClientConnectionPending;
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 548519117a..7c86406a38 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -35,6 +35,7 @@ typedef enum
PROCSIG_WALSND_INIT_STOPPING, /* ask walsenders to prepare for shutdown */
PROCSIG_BARRIER, /* global barrier interrupt */
PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
+ PROCSIG_LOG_EXPLAIN_PLAN, /* ask backend to log the explain plan */
PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
/* Recovery conflict reasons */
--
2.32.0 (Apple Git-132)
Have you seen the other recent patch regarding this? [1]https://commitfest.postgresql.org/45/4345/ The mailing
list thread was active pretty recently. The submission is marked as
Needs Review. I haven't looked at either patch, but the proposals are
very similar as I understand it.
Hello Maciek,
Thanks for pointing that out. They are indeed super similar. Before I wrote
the patch I searched for
"explain" related ones. I guess I should have performed a better search.
Comparing the patches, there is one main difference: the existing patch
prints only the plan without
any instrumentation details of the current execution state at that
particular time. That is precisely what
I am looking for with this new feature.
I guess the way to proceed here would be to use the already existing patch
as a lot of work was done
there already.
Hi Rafael,
On Sun, Dec 3, 2023 at 2:27 AM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
Hello Maciek,
Thanks for pointing that out. They are indeed super similar. Before I wrote the patch I searched for
"explain" related ones. I guess I should have performed a better search.Comparing the patches, there is one main difference: the existing patch prints only the plan without
any instrumentation details of the current execution state at that particular time. That is precisely what
I am looking for with this new feature.I guess the way to proceed here would be to use the already existing patch as a lot of work was done
there already.
It might help if you add an incremental patch for reporting
instrumentation details on top of already existing patches on that
thread.
--
Best Wishes,
Ashutosh Bapat