Proposal: Progressive explain
Hello community,
CONTEXT:
Back in October I presented the talk "Debugging active queries with
mid-flight instrumented explain plans" at PGConf EU 2024
(recording: https://www.youtube.com/watch?v=6ahTb-7C05c) presenting
an experimental feature that enables visualization of in progress
EXPLAIN ANALYZE executions. Given the positive feedback and requests,
I am sending this patch with the feature, which I am calling Progressive
Explain.
PROPOSAL:
This proposal introduces a feature to print execution plans of active
queries in an in-memory shared hash object so that other sessions can
visualize them with a new view: pg_stat_progress_explain.
Plans are only printed if the new GUC parameter progressive_explain is
enabled.
For regular queries or queries started with EXPLAIN (without ANALYZE)
the plan is printed only once at the start.
For instrumented runs (started via EXPLAIN ANALYZE or when auto_explain
flag log_analyze is enabled), the plan is printed on a fixed interval
controlled by the new GUC parameter progressive_explain_interval. This plan
includes all instrumentation stats computed so far (per node rows and
execution time).
New view:
- pg_stat_progress_explain
- pid: PID of the process running the query
- last_explain: timestamp when plan was last printed
- explain_count: amount of times plan was printed
- total_explain_time: accumulated time spend printing plans (in ms)
- explain: the actual plan (limited read privileges)
New GUCs:
- progressive_explain: if progressive plans are printed for local
session.
- type: bool
- default: off
- context: user
- progressive_explain_interval: interval between each explain print.
- type: int
- default: 1s
- min: 10ms
- context: user
- progressive_explain_sample_rate: fraction of rows processed by the
query until progressive_explain_interval is evaluated to print a
progressive plan
- type: floating point
- default: 0.1
- range: (0.0 - 1.0)
- context: user
- progressive_explain_output_size: max output size of the plan
printed in the in-memory hash table.
- type: int
- default: 4096
- min: 100
- context: postmaster
- progressive_explain_format: format used to print the plans.
- type: enum
- default: text
- context: user
- progressive_explain_settings: controls whether information about
modified configuration is added to the printed plan.
- type: bool
- default: off
- context: user
- progressive_explain_verbose: controls whether verbose details are
added to the printed plan.
- type: bool
- default: off
- context: user
DEMONSTRATION:
postgres=# SET progressive_explain = ON;
SET
postgres=# EXPLAIN ANALYZE SELECT *
FROM test t1
UNION ALL
SELECT *
FROM test t1;
postgres=# select * from pg_stat_progress_explain;
-[ RECORD 1
]------+---------------------------------------------------------------------------------------------------------------------------------------
pid | 299663
last_explain | 2024-12-29 22:40:33.016833+00
explain_count | 5
total_explain_time | 0.205
explain | Append (cost=0.00..466670.40 rows=20000160 width=37)
(actual time=0.052..3372.318 rows=14013813 loops=1)
| Buffers: shared hit=4288 read=112501
| -> Seq Scan on test t1 (cost=0.00..183334.80
rows=10000080 width=37) (actual time=0.052..1177.428 rows=10000000 loops=1)
| Buffers: shared hit=4288 read=79046
| -> Seq Scan on test t1_1 (cost=0.00..183334.80
rows=10000080 width=37) (actual time=0.072..608.481 rows=4013813 loops=1)
(current)
| Buffers: shared read=33455
|
IMPLEMENTATION HIGHLIGHTS:
- The initial plan is printed at the end of standard_ExecutorStart
if progressive_explain is enabled, for both regular queries and
instrumented ones (EXPLAIN ANALYZE):
/*
* Start progressive explain if enabled.
*/
if (progressive_explain)
ProgressiveExplainBegin(queryDesc);
- The incremental plan print for instrumented runs uses a dedicated
ExecProcNode if progressive_explain is enabled:
if (node->instrument)
if (progressive_explain)
node->ExecProcNode = ExecProcNodeInstrExplain;
else
node->ExecProcNode = ExecProcNodeInstr;
else
node->ExecProcNode = node->ExecProcNodeReal;
- ExecProcNodeInstrExplain is identical to ExecProcNodeInstr with an
additional part to print plans based on a sampling logic:
/*
* Update progressive explain based on sampling.
*/
if (pg_prng_double(&pg_global_prng_state) < progressive_explain_sample_rate)
ProgressiveExplainUpdate(node);
That logic was added because ExecProcNodeInstrExplain is called once per
row processed (a lot of times) and performing the timestamp interval
check with progressive_explain_interval to decide whether to print
the plan (done inside ProgressiveExplainUpdate) is expensive. Benchmarks
(shared at the end of this email) show that sampling + timestamp check
gives much better results than performing the timestamp check at every
ExecProcNodeInstrExplain call.
- The plans are stored in a shared hash object (explainArray) allocated
at database start, similar to procArray. ExplainHashShmemSize() computes
shared memory needed for it, based on max_connections + max_parallel_workers
for the amount of elements in the array and progressive_explain_output_size
for the size per element.
- A memory context release callback is configured in the memory context
where the query is running, being responsible for updating explainArray
even when the query doesn't finish gracefully.
- Instrumented plans being printed incrementally need to clone
instrumentation
objects to change them, so each print uses a dedicated memory context
that gets released after the output is constructed. This avoids extended
private memory usage:
/* Dedicated memory context for the current plan being printed */
tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
"Progressive Explain Temporary Context",
ALLOCSET_DEFAULT_SIZES);
- A new version of InstrEndLoop (InstrEndLoopForce) was created that allows
to be called targeting in-progress instrumented objects. Those are common
when traversing the plan tree of an active query.
- Column explain from pg_stat_progress_explain can only be visualized by
superusers or the same role that is running the query. If none of those
conditions are met, users will see "<insufficient privilege>".
- For instrumented runs, the printed includes 2 per node modifiers when
appropriate:
<current>: the plan node currently being processed.
<never executed>: a plan node not processed yet.
IMPLEMENTATION OVERHEAD:
When not used, the overhead added is:
- One IF at standard_ExecutorStart to check if progressive_explain is
enabled
- For instrumented runs (EXPLAIN ANALYZE), one IF at ExecProcNodeFirst
to define ExecProcNode wrapper
BENCHMARKS:
Performed 3 scenarios of benchmarks:
A) Comparison between unpatched PG18, patched with progressive explain
disabled and patched with feature enabled globally (all queries printing
the plan at query start:
- PG18 without patch:
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2173978
latency average = 1.655 ms
tps = 18127.977529 (without initial connection time)
- PG18 with patch:
-- progressive_explain = off
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2198806
latency average = 1.636 ms
tps = 18333.157809 (without initial connection time)
-- progressive_explain = on (prints plan only once per query)
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2047459
latency average = 1.756 ms
tps = 17081.477199 (without initial connection time)
B) EXPLAIN ANALYZE performance with different progressive_explain_interval
settings in patched:
-- progressive_explain = off
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 27
latency average = 4492.845 ms
-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 0.01 (default)
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1 -f
script.sql
number of transactions actually processed: 26
latency average = 4656.067 ms
-- progressive_explain = on
-- progressive_explain_interval = 10ms
-- progressive_explain_sample_rate = 0.01 (default)
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1 -f
script.sql
number of transactions actually processed: 26
latency average = 4785.608 ms
C) EXPLAIN ANALYZE performance in patched with and without
progressive_explain_sample_rate, ie, sampling with 2 different values
and also no sampling logic:
-- progressive_explain = off
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 27
latency average = 4492.845 ms
-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 0.01 (default)
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 26
latency average = 4656.067 ms
-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 1
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 19
latency average = 6432.902 ms
-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- NO SAMPLING LOGIC
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 21
latency average = 5864.820 ms
BENCHMARK RESULTS:
It definitely needs more testing but preliminary results show that:
- From (A) we see that the patch adds negligible overhead when the
feature is not used. Enabling globally reduces overall TPS as all
queries are spending time printing the plan. The idea is to enable
progressive_explain on a per-need basis, only to a subset of sessions
that need it.
- From (B) we see that using progressive explain slightly increases
total execution time. Difference between using progressive_explain_interval
set to 1s (plan printed 4 times per query in the test) and to
10ms (plan printed ~460 times per query in the test) is very small.
The actual overhead appears when changing progressive_explain_sample_rate.
- From (C) we see that progressive_explain_sample_rate with a low
value (default 0.01) performs better than not using sampling or
using progressive_explain_sample_rate = 1. So the overhead of having
the sampling logic is much lower than not sampling at all.
TESTS:
Currently working on tests for a second version of the patch.
DOCUMENTATION:
Added documentation for the new view pg_stat_progress_explain,
new GUCs and a new item in section 14.1:
14.1. Using EXPLAIN
14.1.1. EXPLAIN Basics
14.1.2. EXPLAIN ANALYZE
14.1.3. Caveats
14.1.4. Progressive EXPLAIN
FURTHER DISCUSSION:
Considering that this patch introduces a new major feature with
several new components (view, GUCs, etc), there is open room for
discussion such as:
- Do the columns in pg_stat_progress_explain make sense? Are we
missing or adding unnecessary columns?
- Do the new GUCs make sense and are their default values appropriate?
- Do we want progressive explain to print plans of regular queries
started without EXPLAIN if progressive_explain is enabled or should
the feature be restricted to instrumented queries (EXPLAIN ANALYZE)?
- Is the size of explainHash based on max_connections + max_parallel_workers
large enough or are there other types of backends that use the
executor and will print plans too?
Regards,
Rafael Castro.
Attachments:
v1-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v1-0001-Proposal-for-progressive-explains.patchDownload+937-61
On Sun, Dec 29, 2024 at 8:19 PM Rafael Thofehrn Castro <rafaelthca@gmail.com>
wrote:
Plans are only printed if the new GUC parameter progressive_explain is
enabled.
Maybe track_explain instead? In the spirit of track_activity.
- progressive_explain_output_size: max output size of the plan printed in
the in-memory hash table.
- default: 4096
- min: 100
4096 seems low, if this means the explain plan is truncated at that size.
Also, the 100 minimum seems arbitrary.
So we can enable verbose and settings - but not wal? I could see that one
being useful. Not so much the rest (timing, summary). And buffers has
recently changed, so no need to worry about that. :)
- The plans are stored in a shared hash object (explainArray) allocated at
database start, similar to procArray. ExplainHashShmemSize() computes
shared memory needed for it, based on max_connections +
max_parallel_workers for the amount of elements in the array and
progressive_explain_output_size for the size per element.
Hmmm...don't have a solution/suggestion offhand, but using max_connections
would seem to be allocating a chunk of memory that is never used 99% of the
time, as most people don't run active queries near max_connections.
(Actually, on re-reading my draft, I would prefer a rotating pool like
pg_stat_statements does.)
- Column explain from pg_stat_progress_explain can only be visualized by
superusers or the same role that is running the query. If none of those
conditions are met, users will see "<insufficient privilege>".
Or pg_read_all_stats I presume? Are those other columns (e.g.
explain_count) being visible to anyone really useful, or can we throw them
all behind the same permission restriction?
- From (B) we see that using progressive explain slightly increases total
execution time.
Is this using the default dirt-simple pgbench queries? What about queries
that generate very large explain plans?
- Do the columns in pg_stat_progress_explain make sense? Are we missing or
adding unnecessary columns?
Perhaps having the interval and sample rate in here as well, since they are
user-level and thus could be different from other rows in the view. It is
tempting to throw in other things as well like the query_start and datname,
but we don't want to copy all of pg_stat_activity...
It's not clear if total_explain_time is now() - query_start or something
else. If not, I would love to see an elapsed time interval column.
Perhaps add a leader_pid column. That's something I would always be joining
with pg_stat_activity to find out.
- Do we want progressive explain to print plans of regular queries started
without EXPLAIN if progressive_explain is enabled or should
the feature be restricted to instrumented queries (EXPLAIN ANALYZE)?
The latter, but no strong opinion.
id="guc-progressive-explain"
The new docs should mention the view name here, IMO, in addition to the
existing link that has details.
Random idea: track_explain_min_duration
Looks very cool overall, +1.
Cheers,
Greg
This proposal introduces a feature to print execution plans of active
queries in an in-memory shared hash object so that other sessions can
visualize them with a new view: pg_stat_progress_explain.
Thanks for this thread and for sharing the presentation
material. +1 for the idea of adding instrumentation that
will help users understand the bottlenecks in execution
plans. I want to share my perspective on this topic.
A DBA will want to know:
1/ Where is the bottleneck for a long running query currently
in flight?
2/ For a OLTP workload with many quick plans that
could be further optimized; what plan and what
part of the plan is contributing to the database load?
Having a view like pg_stat_progress_explain ( maybe a more
appropriate name is pg_stat_progress_plan ) will be
extremely useful to allow a user to build monitoring
dashboards to be able to answer such questions.
I do not think however this instrumentation should only be
made available if a user runs EXPLAIN ANALYZE.
In my opinion, this will severely limit the usefulness of this
instrumentation in production. Of course, one can use auto_explain,
but users will be hesitant to enable auto_explain with analyze in
production for all their workloads. Also, there should not be an
auto_explain dependency for this feature.
One approach will be for the view to expose the
explain plan and the current node being executed. I think the
plan_node_id can be exposed for this purpose but have not looked
into this in much detail yet. The plan_node_id can then be used
to locate the part of the plan that is a potential bottleneck ( if that
plan node is the one constantly being called ).
This may also be work that is better suited for an extension, but
core will need to add a hook in ExecProcNode so an extension can
have access to PlanState.
Regards,
Sami Imseih
Amazon Web Services (AWS)
hi.
[48/208] Compiling C object
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
FAILED: contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
/usr/local/gcc-14.1.0/bin/gcc-14.1.0
-Icontrib/postgres_fdw/postgres_fdw.so.p -Isrc/include
-I../../Desktop/pg_src/src5/postgres/src/include
-Isrc/interfaces/libpq
-I../../Desktop/pg_src/src5/postgres/src/interfaces/libpq
-I/usr/include/libxml2 -fdiagnostics-color=always --coverage
-D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE
-Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3
-Wcast-function-type -Wshadow=compatible-local -Wformat-security
-Wdeclaration-after-statement -Wmissing-variable-declarations
-Wno-format-truncation -Wno-stringop-truncation -Wunused-variable
-Wuninitialized -Werror=maybe-uninitialized -Wreturn-type
-DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES
-DREALLOCATE_BITMAPSETS -DLOCK_DEBUG -DRELCACHE_FORCE_RELEASE
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
-DRAW_EXPRESSION_COVERAGE_TEST -fno-omit-frame-pointer -fPIC -pthread
-fvisibility=hidden -MD -MQ
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -MF
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o.d -o
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -c
../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c
In file included from
../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c:22:
../../Desktop/pg_src/src5/postgres/src/include/commands/explain.h:86:9:
error: unknown type name ‘TimestampTz’
86 | TimestampTz last_explain;
| ^~~~~~~~~~~
[58/188] Linking target contrib/sslinfo/sslinfo.so
ninja: build stopped: subcommand failed.
compile failed. the above is the error message.
Thanks Greg, Sami and Jian for the feedback so far.
Maybe track_explain instead? In the spirit of track_activity.
That was the original name, and all other GUCs were following the
track_activity_* logic. Changed to the name of the feature after
discussion with colleagues at EDB. This is definitely open for
discussion.
4096 seems low, if this means the explain plan is truncated at
that size. Also, the 100 minimum seems arbitrary.
Min (100) and max (1048576) are the same as the values for GUC
track_activity_query_size, which has a very similar purpose: controls
the size of pg_stat_activity.query column.
So we can enable verbose and settings - but not wal? I could see
that one being useful. Not so much the rest (timing, summary). And
buffers has recently changed, so no need to worry about that. :)
The logic I used for adding GUCs that control explain options is that
none of these settings should change QueryDesc->instrument_options,
which would change instrumentation options added to the actual
execution. GUCs available modify only the ExplainState object, which
affects only the output printed to pg_stat_progress_explain.
Hmmm...don't have a solution/suggestion offhand, but using max_connections
would seem to be allocating a chunk of memory that is never
used 99% of the time, as most people don't run active queries
near max_connections.
(Actually, on re-reading my draft, I would prefer a rotating pool
like pg_stat_statements does.)
Agreed. Thought about using a similar logic as pg_stat_statements
but the pool size there is very large by default, 5000. The difference
is that pg_stat_statements keeps the data in disk and I wanted to
avoid that as instrumented plans can print new plans very often,
affecting performance.
Maybe one idea would be to include a new GUC (progressive_explain_max_size)
that controls how many rows explainArray can have. If limit is reached
a backend won't print anything in that iteration.
It's not clear if total_explain_time is now() - query_start or something
else. If not, I would love to see an elapsed time interval column.
total_explain_time is accumulated time computed only printing
the plan. It does not include execution time.
Perhaps add a leader_pid column. That's something I would always
be joining with pg_stat_activity to find out.
For prints done by parallel workers? That information is available
in pg_stat_activity. The idea is to use the pid column and join with
pg_stat_activity to get all other relevant details.
I do not think however this instrumentation should only be
made available if a user runs EXPLAIN ANALYZE.
In my opinion, this will severely limit the usefulness of this
instrumentation in production. Of course, one can use auto_explain,
but users will be hesitant to enable auto_explain with analyze in
production for all their workloads. Also, there should not be an
auto_explain dependency for this feature.
One approach will be for the view to expose the
explain plan and the current node being executed. I think the
plan_node_id can be exposed for this purpose but have not looked
into this in much detail yet. The plan_node_id can then be used
to locate the part of the plan that is a potential bottleneck ( if that
plan node is the one constantly being called ).
You mean that we could include the current node being executed even
for non instrumented runs? In that case it would print the plain
plan + current node? That is a valid point and shouldn't be
difficult to implement. The problem is that this would require
adding overhead to ExecProcNode() (non instrumented) and that
can be a performance killer.
This may also be work that is better suited for an extension, but
core will need to add a hook in ExecProcNode so an extension can
have access to PlanState.
Are you talking about implementing your proposal (also printing
plan with current node for non instrumented runs) as an extension
or this whole patch as an extension?
If the whole patch, I thought about that. The thing is that the
proposal also changes ExplainNode() function, the core function
to print a plan. To implement it as an extension we would have to
duplicate 95% of that code.
I do think there is merit in having this feature as part of the
core and use existing extensions (auto_explain for example) to
increment it, like adding your suggestion to use a hook in
ExecProcNode().
compile failed. the above is the error message.
Thanks. It was indeed missing an include. It complained only for
a complete build (including contrib), so I failed to catch it.
Sending a second version with the fix.
Rafael.
On Tue, Dec 31, 2024 at 3:00 AM jian he <jian.universality@gmail.com> wrote:
Show quoted text
hi.
[48/208] Compiling C object
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
FAILED: contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
/usr/local/gcc-14.1.0/bin/gcc-14.1.0
-Icontrib/postgres_fdw/postgres_fdw.so.p -Isrc/include
-I../../Desktop/pg_src/src5/postgres/src/include
-Isrc/interfaces/libpq
-I../../Desktop/pg_src/src5/postgres/src/interfaces/libpq
-I/usr/include/libxml2 -fdiagnostics-color=always --coverage
-D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE
-Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3
-Wcast-function-type -Wshadow=compatible-local -Wformat-security
-Wdeclaration-after-statement -Wmissing-variable-declarations
-Wno-format-truncation -Wno-stringop-truncation -Wunused-variable
-Wuninitialized -Werror=maybe-uninitialized -Wreturn-type
-DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES
-DREALLOCATE_BITMAPSETS -DLOCK_DEBUG -DRELCACHE_FORCE_RELEASE
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
-DRAW_EXPRESSION_COVERAGE_TEST -fno-omit-frame-pointer -fPIC -pthread
-fvisibility=hidden -MD -MQ
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -MF
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o.d -o
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -c
../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c
In file included from
../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c:22:
../../Desktop/pg_src/src5/postgres/src/include/commands/explain.h:86:9:
error: unknown type name ‘TimestampTz’
86 | TimestampTz last_explain;
| ^~~~~~~~~~~
[58/188] Linking target contrib/sslinfo/sslinfo.so
ninja: build stopped: subcommand failed.compile failed. the above is the error message.
Attachments:
v2-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v2-0001-Proposal-for-progressive-explains.patchDownload+938-61
hi.
all the newly added GUC
progressive_explain;
progressive_explain_verbose;
progressive_explain_settings;
progressive_explain_interval;
progressive_explain_output_size;
progressive_explain_format;
progressive_explain_sample_rate;
also need to add to postgresql.conf.sample?
in doc/src/sgml/monitoring.sgml, we also need add
view pg_stat_progress_explain
to the section
<table id="monitoring-stats-dynamic-views-table">
<title>Dynamic Statistics Views</title>
(Table 27.1. Dynamic Statistics Views)
pg_stat_progress_explain.explain will be truncated after 4096 byte.
(default value of progressive_explain_output_size)
so if the progressive_explain_format is json,
and the plan is bigger (imagine two partitioned tables joined
together, each having many partitions)
the column "explain" text may not be a valid json.
Should we be concerned about this?
I don't really understand the actual usage of
pg_stat_progress_explain.explain_count.
Other column usage makes sense to me.
Can you share your idea why we need this column?
select name, category from pg_settings
where category = 'Query Tuning / Planner Method Configuration';
you will see that in config_group as QUERY_TUNING_METHOD
all the GUC names generally begin with "enable".
all the GUC names begin with "progressive" set the config_group as
QUERY_TUNING_METHOD
may not be appropriate? also it is not related to query tuning.
#include "utils/backend_status.h"
#include "storage/procarray.h"
#include "executor/spi.h"
#include "utils/guc.h"
src/backend/commands/explain.c
the header generally should be sorted in alphabetical ascending order.
apply the order to ipci.c, execMain.c, execProcnode.c
else
/* Node in progress */
if (es->progressive && planstate ==
planstate->state->progressive_explain_current_node)
appendStringInfo(es->str,
" (actual rows=%.0f loops=%.0f) (current)",
rows, nloops);
else
appendStringInfo(es->str,
" (actual rows=%.0f loops=%.0f)",
rows, nloops);
the above part in src/backend/commands/explain.c ExplainNode, the
indentation looks wrong to me.
Sending rebased version to fix cfbot tests.
Attachments:
v3-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v3-0001-Proposal-for-progressive-explains.patchDownload+938-61
Hi Rafael,
This sounds like a great feature, thanks for working on it and sharing
the patch. Let me share some comments / questions after a quick review.
On 12/30/24 02:18, Rafael Thofehrn Castro wrote:
Hello community,
CONTEXT:
Back in October I presented the talk "Debugging active queries with
mid-flight instrumented explain plans" at PGConf EU 2024
(recording: https://www.youtube.com/watch?v=6ahTb-7C05c <https://
www.youtube.com/watch?v=6ahTb-7C05c>) presenting
an experimental feature that enables visualization of in progress
EXPLAIN ANALYZE executions. Given the positive feedback and requests,
I am sending this patch with the feature, which I am calling Progressive
Explain.
Really nice talk, I enjoyed watching it a couple weeks back.
PROPOSAL:
This proposal introduces a feature to print execution plans of active
queries in an in-memory shared hash object so that other sessions can
visualize them with a new view: pg_stat_progress_explain.Plans are only printed if the new GUC parameter progressive_explain is
enabled.
Aren't the names of the view / GUC are a bit misleading? Because this is
really about EXPLAIN ANALYZE, not just EXPLAIN. Not sure.
For regular queries or queries started with EXPLAIN (without ANALYZE)
the plan is printed only once at the start.
Initially I thought this is a bit weird, and also inconsistent with the
auto_explain (which prints plans at the end). But I think for "progress"
that makes sense, even if it's not updated after that.
For instrumented runs (started via EXPLAIN ANALYZE or when auto_explain
flag log_analyze is enabled), the plan is printed on a fixed interval
controlled by the new GUC parameter progressive_explain_interval. This plan
includes all instrumentation stats computed so far (per node rows and
execution time).
OK, I understand why it works like this. But I rather dislike that it
relies on auto_explain to enable the progressive updates. If we want to
make this dependent on auto_explain, then most of this could/should be
moved to auto_explain, and called through a hook. Or something similar.
If we want to add this to core, then I think the progressive_explain GUC
should have "off|explain|analyze" values, or a similar way to enable the
instrumentation and regular updates. But that would probably also need
to duplicate a number of auto_explain options (e.g. to allow disabling
timings).
New view:
- pg_stat_progress_explain
- pid: PID of the process running the query
- last_explain: timestamp when plan was last printed
- explain_count: amount of times plan was printed
- total_explain_time: accumulated time spend printing plans (in ms)
- explain: the actual plan (limited read privileges)
I find the "explain_time" a bit misleading. On the one hand - yes, it
does account for time spent generating the EXPLAIN output. But on the
other hand, that's only a part of the overhead - it ignores the overhead
of the extra instrumentation. Also, wouldn't it be better to rename
"explain" to "query plan"? That's what the EXPLAIN result says too.
New GUCs:
- progressive_explain: if progressive plans are printed for local
session.
- type: bool
- default: off
- context: user
- progressive_explain_interval: interval between each explain print.
- type: int
- default: 1s
- min: 10ms
- context: user
Seems reasonable (but I already commented on progressive_explain).
- progressive_explain_sample_rate: fraction of rows processed by the
query until progressive_explain_interval is evaluated to print a
progressive plan
- type: floating point
- default: 0.1
- range: (0.0 - 1.0)
- context: user
I find this pretty weird / unnecessary. More comments later.
- progressive_explain_output_size: max output size of the plan
printed in the in-memory hash table.
- type: int
- default: 4096
- min: 100
- context: postmaster
Seems far too small.
- progressive_explain_format: format used to print the plans.
- type: enum
- default: text
- context: user
Good idea. When processing the view automatically, JSON would be much
better than text.
- progressive_explain_settings: controls whether information about
modified configuration is added to the printed plan.
- type: bool
- default: off
- context: user
- progressive_explain_verbose: controls whether verbose details are
added to the printed plan.
- type: bool
- default: off
- context: user
This seems to be the duplication of some auto_explain parameters that I
mentioned above. Except that it only duplicates some of them. I'd like
the ability to disable collecting timing info, which is usually by far
the main overhead.
DEMONSTRATION:
postgres=# SET progressive_explain = ON;
SET
postgres=# EXPLAIN ANALYZE SELECT *
FROM test t1
UNION ALL
SELECT *
FROM test t1;postgres=# select * from pg_stat_progress_explain; -[ RECORD 1 ]------ +--------------------------------------------------------------------------------------------------------------------------------------- pid | 299663 last_explain | 2024-12-29 22:40:33.016833+00 explain_count | 5 total_explain_time | 0.205 explain | Append (cost=0.00..466670.40 rows=20000160 width=37) (actual time=0.052..3372.318 rows=14013813 loops=1) | Buffers: shared hit=4288 read=112501 | -> Seq Scan on test t1 (cost=0.00..183334.80 rows=10000080 width=37) (actual time=0.052..1177.428 rows=10000000 loops=1) | Buffers: shared hit=4288 read=79046 | -> Seq Scan on test t1_1 (cost=0.00..183334.80 rows=10000080 width=37) (actual time=0.072..608.481 rows=4013813 loops=1) (current) | Buffers: shared read=33455 |IMPLEMENTATION HIGHLIGHTS:
- The initial plan is printed at the end of standard_ExecutorStart
if progressive_explain is enabled, for both regular queries and
instrumented ones (EXPLAIN ANALYZE):/*
* Start progressive explain if enabled.
*/
if (progressive_explain)
ProgressiveExplainBegin(queryDesc);- The incremental plan print for instrumented runs uses a dedicated
ExecProcNode if progressive_explain is enabled:if (node->instrument)
if (progressive_explain)
node->ExecProcNode = ExecProcNodeInstrExplain;
else
node->ExecProcNode = ExecProcNodeInstr;
else
node->ExecProcNode = node->ExecProcNodeReal;- ExecProcNodeInstrExplain is identical to ExecProcNodeInstr with an
additional part to print plans based on a sampling logic:
I haven't looked at the details yet, but this seems mostly reasonable.
/*
* Update progressive explain based on sampling.
*/
if (pg_prng_double(&pg_global_prng_state) < progressive_explain_sample_rate)
ProgressiveExplainUpdate(node);That logic was added because ExecProcNodeInstrExplain is called once per
row processed (a lot of times) and performing the timestamp interval
check with progressive_explain_interval to decide whether to print
the plan (done inside ProgressiveExplainUpdate) is expensive. Benchmarks
(shared at the end of this email) show that sampling + timestamp check
gives much better results than performing the timestamp check at every
ExecProcNodeInstrExplain call.
I don't think this is the way to deal with the high cost of collecting
timing information. It just adds unpredictability, because how would you
know what's a good sample rate? What if you pick a far too low value,
and then a plan gets logged much less frequently because for that
particular query it takes much longer to return a tuple?
I think the right solution to deal with high cost of timing are
timeouts. Set a timeout, with a handler that sets a "pending" flag, and
check the flag before calling ProgressiveExplainUpdate(). See for
example TransactionTimeoutHandler in postinit.c.
- The plans are stored in a shared hash object (explainArray) allocated
at database start, similar to procArray. ExplainHashShmemSize() computes
shared memory needed for it, based on max_connections + max_parallel_workers
for the amount of elements in the array and progressive_explain_output_size
for the size per element.
Why do we need a shared hash table?
- A memory context release callback is configured in the memory context
where the query is running, being responsible for updating explainArray
even when the query doesn't finish gracefully.
OK
- Instrumented plans being printed incrementally need to clone
instrumentation
objects to change them, so each print uses a dedicated memory context
that gets released after the output is constructed. This avoids extended
private memory usage:
Yeah. I was wondering how you're going to deal with this. Wouldn't it be
cheaper to just use a static variable? I don't think this is called
recursively, and that'd save the palloc/pfree. Haven't tried and not
sure if it's worth it.
/* Dedicated memory context for the current plan being printed */
tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
"Progressive Explain Temporary Context",
ALLOCSET_DEFAULT_SIZES);
Maybe it'd be better to keep this memory context for the query duration
(e.g. by adding it to queryDesc), and just reset it before the calls?
That'd cache the memory, and it shouldn't really use a lot of it, no?
Also, ProgressiveExplainPrint() does this at the end:
+ /*
+ * Free local explain state before exiting as this function may be
+ * called multiple times in the same memory context.
+ */
+ pfree(es->str);
+ pfree(es);
Isn't this actually pointless with the local memory context?
- A new version of InstrEndLoop (InstrEndLoopForce) was created that allows
to be called targeting in-progress instrumented objects. Those are common
when traversing the plan tree of an active query.
No opinion yet.
- Column explain from pg_stat_progress_explain can only be visualized by
superusers or the same role that is running the query. If none of those
conditions are met, users will see "<insufficient privilege>".
I think this is in line with how we restrict access to similar catalogs.
- For instrumented runs, the printed includes 2 per node modifiers when
appropriate:<current>: the plan node currently being processed.
<never executed>: a plan node not processed yet.
Not sure the <current> label is all that useful. It seems misleading at
best, because it's simply the last node that generated the explain. But
we might have already moved to a different node. And that node may be
stuck / very expensive, yet we'll see the plan as seemingly waiting in
some other node.
IMPLEMENTATION OVERHEAD:
When not used, the overhead added is:
- One IF at standard_ExecutorStart to check if progressive_explain is
enabled
- For instrumented runs (EXPLAIN ANALYZE), one IF at ExecProcNodeFirst
to define ExecProcNode wrapperBENCHMARKS:
Performed 3 scenarios of benchmarks:
A) Comparison between unpatched PG18, patched with progressive explain
disabled and patched with feature enabled globally (all queries printing
the plan at query start:- PG18 without patch:
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2173978
latency average = 1.655 ms
tps = 18127.977529 (without initial connection time)- PG18 with patch:
-- progressive_explain = off
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2198806
latency average = 1.636 ms
tps = 18333.157809 (without initial connection time)-- progressive_explain = on (prints plan only once per query)
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2047459
latency average = 1.756 ms
tps = 17081.477199 (without initial connection time)B) EXPLAIN ANALYZE performance with different progressive_explain_interval
settings in patched:-- progressive_explain = off
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
-f script.sql
number of transactions actually processed: 27
latency average = 4492.845 ms-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 0.01 (default)postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1
-f script.sql
number of transactions actually processed: 26
latency average = 4656.067 ms-- progressive_explain = on
-- progressive_explain_interval = 10ms
-- progressive_explain_sample_rate = 0.01 (default)postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1
-f script.sql
number of transactions actually processed: 26
latency average = 4785.608 msC) EXPLAIN ANALYZE performance in patched with and without
progressive_explain_sample_rate, ie, sampling with 2 different values
and also no sampling logic:-- progressive_explain = off
postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
-f script.sql
number of transactions actually processed: 27
latency average = 4492.845 ms-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 0.01 (default)postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
-f script.sql
number of transactions actually processed: 26
latency average = 4656.067 ms-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 1postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
-f script.sql
number of transactions actually processed: 19
latency average = 6432.902 ms-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- NO SAMPLING LOGICpostgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
-f script.sql
number of transactions actually processed: 21
latency average = 5864.820 msBENCHMARK RESULTS:
It definitely needs more testing but preliminary results show that:
- From (A) we see that the patch adds negligible overhead when the
feature is not used. Enabling globally reduces overall TPS as all
queries are spending time printing the plan. The idea is to enable
progressive_explain on a per-need basis, only to a subset of sessions
that need it.- From (B) we see that using progressive explain slightly increases
total execution time. Difference between using progressive_explain_interval
set to 1s (plan printed 4 times per query in the test) and to
10ms (plan printed ~460 times per query in the test) is very small.
The actual overhead appears when changing progressive_explain_sample_rate.- From (C) we see that progressive_explain_sample_rate with a low
value (default 0.01) performs better than not using sampling or
using progressive_explain_sample_rate = 1. So the overhead of having
the sampling logic is much lower than not sampling at all.
No opinion. I need to do some testing / benchmarking myself.
TESTS:
Currently working on tests for a second version of the patch.
DOCUMENTATION:
Added documentation for the new view pg_stat_progress_explain,
new GUCs and a new item in section 14.1:14.1. Using EXPLAIN
14.1.1. EXPLAIN Basics
14.1.2. EXPLAIN ANALYZE
14.1.3. Caveats
14.1.4. Progressive EXPLAINFURTHER DISCUSSION:
Considering that this patch introduces a new major feature with
several new components (view, GUCs, etc), there is open room for
discussion such as:- Do the columns in pg_stat_progress_explain make sense? Are we
missing or adding unnecessary columns?- Do the new GUCs make sense and are their default values appropriate?
- Do we want progressive explain to print plans of regular queries
started without EXPLAIN if progressive_explain is enabled or should
the feature be restricted to instrumented queries (EXPLAIN ANALYZE)?- Is the size of explainHash based on max_connections + max_parallel_workers
large enough or are there other types of backends that use the
executor and will print plans too?
I've commented on some of these items earlier.
regards
--
Tomas Vondra
On 12/30/24 2:18 AM, Rafael Thofehrn Castro wrote:
Hello community,
CONTEXT:
Back in October I presented the talk "Debugging active queries with
mid-flight instrumented explain plans" at PGConf EU 2024
(recording: https://www.youtube.com/watch?v=6ahTb-7C05c) presenting
an experimental feature that enables visualization of in progress
EXPLAIN ANALYZE executions. Given the positive feedback and requests,
I am sending this patch with the feature, which I am calling Progressive
Explain.PROPOSAL:
This proposal introduces a feature to print execution plans of active
queries in an in-memory shared hash object so that other sessions can
visualize them with a new view: pg_stat_progress_explain.
Hello
Thanks for your contribution! Such feature would be very useful.
I did not look carefully about the implementation. I just wanted to let
you know someone already posted a similar feature in this thread :
/messages/by-id/CADdR5ny_0dFwnD+suBnV1Vz6NDKbFHeWoV1EDv9buhDCtc3aAA@mail.gmail.com
Maybe it could give you some ideas.
Regards,
--
Adrien NAYRAT
Hi,
Thanks for the valuable feedback Tomas. I am sending a new version of the
patch
that includes:
- Changed instrumented plan printing based on timeouts instead of sampling.
This works perfectly and benchmarks are promising. So new GUC
progressive_explain_sampe_rate is removed.
- Removed all parts of the code where allocated memory for instrumentation
and explain objects is released as this is done automatically when the
custom memory context is released. Comments added later on.
- Adjusted regression tests expected objects so tests pass.
OK, I understand why it works like this. But I rather dislike that it
relies on auto_explain to enable the progressive updates. If we want to
make this dependent on auto_explain, then most of this could/should be
moved to auto_explain, and called through a hook. Or something similar.
If we want to add this to core, then I think the progressive_explain GUC
should have "off|explain|analyze" values, or a similar way to enable the
instrumentation and regular updates. But that would probably also need
to duplicate a number of auto_explain options (e.g. to allow disabling
timings).
This seems to be the duplication of some auto_explain parameters that I
mentioned above. Except that it only duplicates some of them. I'd like
the ability to disable collecting timing info, which is usually by far
the main overhead.
I implemented using the current logic with the premise that this new feature
shouldn't change how queries are executed. If a query is not using
EXPLAIN ANALYZE then it shouldn't transparently enable instrumentation,
which will then cause additional overhead. If someone wants an instrumented
run, then it explicitly calls EXPLAIN ANALYZE.
That is the same reasoning for coming up with the GUCs. The only ones
available are progressive_explain_format, progressive_explain_settings
and progressive_explain_verbose, which will only affect the explain
output without changing how the query is executed.
But If we come to an agreement that "off|explain|analyze" values for
progressive_explain GUC is a better approach, changing the logic should
be easy. Now that you mentioned the idea, it does make sense.
I think the right solution to deal with high cost of timing are
timeouts. Set a timeout, with a handler that sets a "pending" flag, and
check the flag before calling ProgressiveExplainUpdate(). See for
example TransactionTimeoutHandler in postinit.c.
Thanks for that. The new patch sent in this message uses a timeout instead.
Why do we need a shared hash table?
This is where I am storing the printed plans so other backends can read
from. I saw that as the only option apart from printing to files. Do
you recommend something else?
Yeah. I was wondering how you're going to deal with this. Wouldn't it be
cheaper to just use a static variable? I don't think this is called
recursively, and that'd save the palloc/pfree. Haven't tried and not
sure if it's worth it.
You mean an instrumented object allocated only once that will be updated
with the counters of the currently processed node? That is also an option.
Maybe it is worth testing considering that it may save millions of
palloc/pfree.
Will give it a go.
Maybe it'd be better to keep this memory context for the query duration
(e.g. by adding it to queryDesc), and just reset it before the calls?
That'd cache the memory, and it shouldn't really use a lot of it, no?
Instead of having a dedicated memory context, use the same context as the
one running the query? This will then require manually freeing the
ExplainEstate
allocated in each ProgressiveExplainPrint() call (I removed in this patch
that
manual allocation that wasn't needed currently) and also the allocated
instrumentation objects (which will no longer be needed if we use a static
variable). This will save memory context creation/release and is definitely
a better option. Will test it.
Not sure the <current> label is all that useful. It seems misleading at
best, because it's simply the last node that generated the explain. But
we might have already moved to a different node. And that node may be
stuck / very expensive, yet we'll see the plan as seemingly waiting in
some other node.
Right. The idea here is that, for some specific cases, the execution is
momentarily limited to a subpart of the plan. For example, if a HashJoin
is present that whole set of nodes inside the hash creation need to be
complete before other operations are done. If we see a <current> as part
of that hash creation we know that the query is currently in that region,
regardless of which specific node it is.
This could become a feature to be created as part of an extension with the
help of a new hook inside ExplainNode() if we come to the agreement that
it is not really needed out of the box.
Thanks for your contribution! Such feature would be very useful.
I did not look carefully about the implementation. I just wanted to let
you know someone already posted a similar feature in this thread :
Maybe it could give you some ideas.
Thanks Adrien! Will have a look.
Regards,
Rafael Castro.
Attachments:
v4-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v4-0001-Proposal-for-progressive-explains.patchDownload+930-60
Hello all,
Sending a new version of the patch that includes important changes
addressing
feedback provided by Greg and Tomas. So, including the previous version (v5)
sent on Jan 29, these are the highlights of what has changed:
- Progressive plan printed on regular interval defined by
progressive_explain_timeout
now uses timeouts. GUC progressive_explain_sampe_rate is removed.
- Objects allocated per plan print (Instrument and ExplainState) were
replaced
by reusable objects allocated at query start (during progressive explain
setup
phase). So currently we are allocating only 2 objects for the complete
duration
of the feature. With that, removed the temporary memory context that was
being
allocated per iteration.
- progressive_explain GUC was changed from boolean to enum, accepting
values 'off',
'explain' and 'analyze'. This allows using instrumented progressive explains
for any query and not only the ones started via EXPLAIN ANALYZE. If GUC is
set to 'explain' the plan will be printed only once at query start. If set
to 'analyze' instrumentation will be enabled in QueryDesc and the detailed
plan will be printed iteratively. Considering that now we can enable
instrumentation
for regular queries, added the following GUCs to control what instruments
are enabled: progressive_explain_buffers, progressive_explain_timing and
progressive_explain_wals.
- better handling of shared memory space where plans are printed and shared
with other backends. In previous version we had a shared hash with elements
holding all data related to progressive explains, including the complete
plan string:
typedef struct explainHashEntry
{
explainHashKey key; /* hash key of entry - MUST BE FIRST */
int pid;
TimestampTz last_explain;
int explain_count;
float explain_duration;
char plan[];
} explainHashEntry;
The allocated size per element used to be defined by
progressive_explain_output_size,
which would essentially control the space available for plan[].
Greg raised the concern of PG having to allocate too much shared memory
at database start considering that we need enough space for max_connections
+
max_parallel_workers, and that is a totally valid point.
So the new version takes advantage of DSAs. Each backend creates its own
DSA at query start (if progressive explain is enabled) where the shared
data is stored. That DSA is shared with other backends via hash structure
through dsa_handle and dsa_pointer pointers:
typedef struct progressiveExplainHashEntry
{
progressiveExplainHashKey key; /* hash key of entry - MUST BE FIRST */
dsa_handle h;
dsa_pointer p;
} progressiveExplainHashEntry;
typedef struct progressiveExplainData
{
int pid;
TimestampTz last_print;
char plan[];
} progressiveExplainData;
That allows us to allocate areas of custom sizes for plan[]. The strategy
being used currently is to allocate an initial space with the size of the
initial plan output + PROGRESSIVE_EXPLAIN_ALLOC_SIZE (4096 currently), which
gives PG enough room for subsequent iterations where the new string may
increase a bit, without having to reallocate space. The code checks sizes
and
will reallocate if needed. With that, GUC progressive_explain_output_size
was removed.
- Adjusted columns of pg_stat_progress_explain. Columns explain_count and
total_explain_time were removed. Column last_explain was renamed to
last_print.
Column explain was renamed to query_plan, as this is the name used by PG
when a plan is printed with EXPLAIN.
Rafael.
Attachments:
v5-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v5-0001-Proposal-for-progressive-explains.patchDownload+1108-66
Fixed a corner case where pg_stat_progress_explain is looking at its own
plan.
Previous message in this thread contains all relevant implementation
details of
the last patch.
On Tue, Feb 18, 2025 at 3:31 PM Rafael Thofehrn Castro <rafaelthca@gmail.com>
wrote:
Show quoted text
Hello all,
Sending a new version of the patch that includes important changes
addressing
feedback provided by Greg and Tomas. So, including the previous version
(v5)
sent on Jan 29, these are the highlights of what has changed:- Progressive plan printed on regular interval defined by
progressive_explain_timeout
now uses timeouts. GUC progressive_explain_sampe_rate is removed.- Objects allocated per plan print (Instrument and ExplainState) were
replaced
by reusable objects allocated at query start (during progressive explain
setup
phase). So currently we are allocating only 2 objects for the complete
duration
of the feature. With that, removed the temporary memory context that was
being
allocated per iteration.- progressive_explain GUC was changed from boolean to enum, accepting
values 'off',
'explain' and 'analyze'. This allows using instrumented progressive
explains
for any query and not only the ones started via EXPLAIN ANALYZE. If GUC is
set to 'explain' the plan will be printed only once at query start. If set
to 'analyze' instrumentation will be enabled in QueryDesc and the detailed
plan will be printed iteratively. Considering that now we can enable
instrumentation
for regular queries, added the following GUCs to control what instruments
are enabled: progressive_explain_buffers, progressive_explain_timing and
progressive_explain_wals.- better handling of shared memory space where plans are printed and shared
with other backends. In previous version we had a shared hash with elements
holding all data related to progressive explains, including the complete
plan string:typedef struct explainHashEntry
{
explainHashKey key; /* hash key of entry - MUST BE FIRST */
int pid;
TimestampTz last_explain;
int explain_count;
float explain_duration;
char plan[];
} explainHashEntry;The allocated size per element used to be defined by
progressive_explain_output_size,
which would essentially control the space available for plan[].Greg raised the concern of PG having to allocate too much shared memory
at database start considering that we need enough space for
max_connections +
max_parallel_workers, and that is a totally valid point.So the new version takes advantage of DSAs. Each backend creates its own
DSA at query start (if progressive explain is enabled) where the shared
data is stored. That DSA is shared with other backends via hash structure
through dsa_handle and dsa_pointer pointers:typedef struct progressiveExplainHashEntry
{
progressiveExplainHashKey key; /* hash key of entry - MUST BE FIRST */
dsa_handle h;
dsa_pointer p;
} progressiveExplainHashEntry;typedef struct progressiveExplainData
{
int pid;
TimestampTz last_print;
char plan[];
} progressiveExplainData;That allows us to allocate areas of custom sizes for plan[]. The strategy
being used currently is to allocate an initial space with the size of the
initial plan output + PROGRESSIVE_EXPLAIN_ALLOC_SIZE (4096 currently),
which
gives PG enough room for subsequent iterations where the new string may
increase a bit, without having to reallocate space. The code checks sizes
and
will reallocate if needed. With that, GUC progressive_explain_output_size
was removed.- Adjusted columns of pg_stat_progress_explain. Columns explain_count and
total_explain_time were removed. Column last_explain was renamed to
last_print.
Column explain was renamed to query_plan, as this is the name used by PG
when a plan is printed with EXPLAIN.Rafael.
Attachments:
v6-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v6-0001-Proposal-for-progressive-explains.patchDownload+1116-66
Hello,
Was recently benchmarking the last version of the patch and found room for
improvement when GUC progressive_explain is enabled globally.
Results with the last published version of the patch:
- progressive_explain = off:
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18249.363540 (without initial connection time)
- progressive_explain = 'explain':
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 3536.635125 (without initial connection time)
This is because progressive explains are being printed for every query,
including
the ones that finish instantly.
If we think about it, those printed plans for instant queries are useless as
other backends won't have a chance to look at the plans before they get
removed
from pg_stat_progress_explain.
So this new version of the patch implements new GUC
progressive_explain_min_duration
to be a used as a threshold for the plan to be printed for the first time:
- progressive_explain_min_duration: min query duration until progressive
explain starts.
- type: int
- default: 1s
- min: 0
- context: user
Results with the new version:
- progressive_explain = off:
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18871.800242 (without initial connection time)
- progressive_explain = 'explain' and progressive_explain_min_duration =
'5s':
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18896.266538 (without initial connection time)
Implementation of the new GUC progressive_explain_min_duration was done with
timeouts. The timeout callback function is used to initialize the
progressive
explain.
There is a catch to this implementation. In thread
/messages/by-id/d68c3ae31672664876b22d2dcbb526d2@postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns
about
logging plans in the CFI, a sensible part of the code. So torikoshia
implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all
nodes
so that the plan printing can be done there.
I'm not sure if this same concern applies to timeout callbacks so I also
implemented
a second version of the latest patch that uses that execProcNode wrapper
strategy.
The wrapper code was implemented by torikoshia (torikoshia@oss.nttdata.com),
so
adding the credits here.
Rafael.
Attachments:
v7-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v7-0001-Proposal-for-progressive-explains.patchDownload+1317-61
v7-0001-Proposal-for-progressive-explains-with-execprocnode-wrapper.patch.nocfbotapplication/octet-stream; name=v7-0001-Proposal-for-progressive-explains-with-execprocnode-wrapper.patch.nocfbotDownload+1504-61
Implementation of the new GUC progressive_explain_min_duration was done
with
timeouts. The timeout callback function is used to initialize the
progressive
explain.There is a catch to this implementation. In thread
/messages/by-id/d68c3ae31672664876b22d2dcbb526d2@postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns
about
logging plans in the CFI, a sensible part of the code. So torikoshia
implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all
nodes
so that the plan printing can be done there.I'm not sure if this same concern applies to timeout callbacks so I also
implemented
a second version of the latest patch that uses that execProcNode wrapper
strategy.The wrapper code was implemented by torikoshia (torikoshia@oss.nttdata.com),
so
adding the credits here.
Did additional benchmarks and found issues with the patch that doesn't do
execProcNode
wrapping. There are sporadic crashes with *double free or corruption (top)*
So making the patch that uses the wrapper the current one. Again, giving
the credits to
torikoshia as being the owner of that section of the code.
Rafael.
Show quoted text
Attachments:
v8-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v8-0001-Proposal-for-progressive-explains.patchDownload+1497-61
On Fri, Mar 7, 2025, at 5:34 PM, Rafael Thofehrn Castro wrote:
Did additional benchmarks and found issues with the patch that doesn't do execProcNode
wrapping. There are sporadic crashes with **double free or corruption (top)**
****
So making the patch that uses the wrapper the current one. Again, giving the credits to
torikoshia as being the owner of that section of the code.
Rafael, thanks for working on it. It is a step forward in observability. I
started with some performance tests and the last improvements seem to fix the
overhead imposed in the initial patch version. I didn't notice any of these new
function in the perf report while executing fast queries.
I found a crash. It is simple to reproduce.
Session A:
select * from pg_stat_progress_explain;
\watch 2
Session B:
explain select pg_sleep(30);
\watch 2
8<--------------------------------------------------------------------8<
Backtrace:
Core was generated by `postgres: euler postgres [lo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
5401 if (ps->ExecProcNodeOriginal != NULL)
#0 WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
#1 0x00005624173829aa in handle_sig_alarm (postgres_signal_arg=<optimized out>) at timeout.c:414
#2 0x00005624173ba02c in wrapper_handler (postgres_signal_arg=14) at pqsignal.c:110
#3 <signal handler called>
#4 0x00007f20fa529e63 in epoll_wait (epfd=6, events=0x56244ef37e58, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#5 0x00005624171fb02f in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdd9e62080, cur_timeout=-1, set=0x56244ef37dd8) at waiteventset.c:1190
#6 WaitEventSetWait (set=0x56244ef37dd8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffdd9e62080, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296) at waiteventset.c:1138
#7 0x000056241709513c in secure_read (port=0x56244eeb90e0, ptr=0x56241775a9a0 <PqRecvBuffer>, len=8192) at be-secure.c:218
#8 0x000056241709bf2e in pq_recvbuf () at pqcomm.c:924
#9 0x000056241709ceb5 in pq_getbyte () at pqcomm.c:970
#10 0x000056241721b617 in SocketBackend (inBuf=0x7ffdd9e622a0) at postgres.c:361
#11 ReadCommand (inBuf=0x7ffdd9e622a0) at postgres.c:484
#12 PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4625
#13 0x00005624172167ed in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:107
#14 0x000056241717519b in postmaster_child_launch (child_type=<optimized out>, child_slot=2, startup_data=startup_data@entry=0x7ffdd9e6253c, startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0x7ffdd9e62540) at launch_backend.c:274
#15 0x0000562417178c32 in BackendStartup (client_sock=0x7ffdd9e62540) at postmaster.c:3519
#16 ServerLoop () at postmaster.c:1688
#17 0x000056241717a6da in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x56244eeb81b0) at postmaster.c:1386
#18 0x0000562416e64f9a in main (argc=1, argv=0x56244eeb81b0) at main.c:230
8<--------------------------------------------------------------------8<
You call this feature "progressive explain". My first impression is that it
will only provide the execution plans for EXPLAIN commands. Instead of
"progressive explain", I would suggest "query progress" that is a general
database terminology. It seems natural to use "progressive explain" since you
are using the explain infrastructure (including the same options -- format,
settings, wal, ...) -- to print the execution plan.
+CREATE VIEW pg_stat_progress_explain AS
+ SELECT
+ *
+ FROM pg_stat_progress_explain(true);
+
There is no use for the function argument. If you decide to keep this function,
remove it.
Why don't you use the pgstat_progress_XXX() API? Since you are using a
pg_stat_progress_XXX view name I would expect using the command progress
reporting infrastructure (see backend_progress.c).
Maybe you could include datid and datname as the other progress reporting
views. It would avoid a join to figure out what the database is.
+static const struct config_enum_entry explain_format_options[] = {
+ {"text", EXPLAIN_FORMAT_TEXT, false},
+ {"xml", EXPLAIN_FORMAT_XML, false},
+ {"json", EXPLAIN_FORMAT_JSON, false},
+ {"yaml", EXPLAIN_FORMAT_YAML, false},
+ {NULL, 0, false}
+};
Isn't it the same definition as in auto_explain.c? Use only one definition for
auto_explain and this feature. You can put this struct into explain.c, use an
extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion into
guc.h. See wal_level_options, for an example.
+static const struct config_enum_entry progressive_explain_options[] = {
+ {"off", PROGRESSIVE_EXPLAIN_NONE, false},
+ {"explain", PROGRESSIVE_EXPLAIN_EXPLAIN, false},
+ {"analyze", PROGRESSIVE_EXPLAIN_ANALYZE, false},
+ {"false", PROGRESSIVE_EXPLAIN_NONE, true},
+ {NULL, 0, false}
+};
The "analyze" is a separate option in auto_explain. Should we have 2 options?
One that enable/disable this feature and another one that enable/disable
analyze option.
Don't the other EXPLAIN options make sense here? Like serialize and summary.
TupleDesc tupDesc; /* descriptor for result tuples */
EState *estate; /* executor's query-wide state */
PlanState *planstate; /* tree of per-plan-node state */
+ struct ExplainState *pe_es; /* progressive explain state if enabled */
Should you use the same name pattern here? pestate, for example.
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ExplainHash)
Could you use a specific name? Even if you keep the proposed name, you should
use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.
+$node->init;
+# Configure progressive explain to be logged immediatelly
+$node->append_conf('postgresql.conf', 'progressive_explain_min_duration = 0');
+$node->start;
s/immediatelly/immediately/
+typedef struct progressiveExplainHashKey
+{
+ int pid; /* PID */
+} progressiveExplainHashKey;
+
+typedef struct progressiveExplainHashEntry
+{
+ progressiveExplainHashKey key; /* hash key of entry - MUST BE FIRST */
+ dsa_handle h;
+ dsa_pointer p;
+} progressiveExplainHashEntry;
You don't need progressiveExplainHashKey. Use pid as key directly.
+ /* Update shared memory with new data */
+ strcpy(pe_data->plan, es->str->data);
+ pe_data->last_print = GetCurrentTimestamp();
I don't think last_print is accurate because it is not the time the execution plan
is printed but the time it was updated. I suggest last_updated_time.
+/* Flag set by timeouts to control when to print progressive explains */
+bool ProgressiveExplainPending = false;
s/print/update/
There are other point that you use "print" but it is better to replace it with
"update".
+ progressiveExplainArray = ShmemInitHash("progressive explain hash",
+ 50, 50,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
better name.
+ entry->p = dsa_allocate(es->pe_a,
+ add_size(sizeof(progressiveExplainData),
+ add_size(strlen(es->str->data),
+ PROGRESSIVE_EXPLAIN_ALLOC_SIZE)));
I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because it
doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
PROGRESSIVE_EXPLAIN_AVAIL_SIZE?
Maybe you can use dshash.
There are some comments that still refers to the wrong function name.
+/*
+ * ExecProcNodeWithExplain
+ * ExecProcNode wrapper that initializes progressive explain
+ * and uwraps ExecProdNode to the original function.
+ */
+static TupleTableSlot *
+ExecProcNodeExplain(PlanState *node)
and
+/*
+ * ExecProcNodeWithExplain
+ * Responsible for initialization of all structures related to progressive
+ * explains.
+ *
+ /* state related to progressive explains */
+ struct PlanState *pe_curr_node;
+ struct Instrumentation *pe_local_instr;
+ dsa_area *pe_a;
Could you add some comments saying what each of these variables are for?
I didn't experiment but I was wondering if there is a way to avoid the
duplicates that you added to avoid the overhead.
--
Euler Taveira
EDB https://www.enterprisedb.com/
Thanks for the valuable inputs Euler. Adjusted most of your recommendations.
I found a crash. It is simple to reproduce.
Indeed, I failed to test plain EXPLAIN after the addition of the new GUC
progressive_explain_min_duration. This is fixed.
You call this feature "progressive explain". My first impression is that
it
will only provide the execution plans for EXPLAIN commands. Instead of
"progressive explain", I would suggest "query progress" that is a general
database terminology. It seems natural to use "progressive explain" since
you
are using the explain infrastructure (including the same options --
format,
settings, wal, ...) -- to print the execution plan.
Makes sense. Kept progressive explain for now but this is still open for
discussion.
There is no use for the function argument. If you decide to keep this
function,
remove it.
Done.
Why don't you use the pgstat_progress_XXX() API? Since you are using a
pg_stat_progress_XXX view name I would expect using the command progress
reporting infrastructure (see backend_progress.c).
I haven't changed that part as of now. My implementation and underlying data
structure may not work well with that API, but I am investigating.
Maybe you could include datid and datname as the other progress reporting
views. It would avoid a join to figure out what the database is.
Done.
Isn't it the same definition as in auto_explain.c? Use only one
definition for
auto_explain and this feature. You can put this struct into explain.c,
use an
extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion
into
guc.h. See wal_level_options, for an example.
Done.
The "analyze" is a separate option in auto_explain. Should we have 2
options?
One that enable/disable this feature and another one that enable/disable
analyze option.
Tomas Vondra proposed the current logic and I think it makes sense. Having
a
single GUC to control the execution behavior keeps the feature simpler IMO.
Don't the other EXPLAIN options make sense here? Like serialize and
summary.
I added a missing GUC for option COSTS (progressive_explain_costs). Adding
the other ones doesn't make much sense IMO. SUMMARY, SERIALIZE and MEMORY
are information added at the end of the query execution (or plan creation
for plain
EXPLAIN) in the summary section but at that point the progressive explain
will be
already finished, with no more information in pg_stat_progress_explain.
TupleDesc tupDesc; /* descriptor for result tuples */
EState *estate; /* executor's query-wide state */
PlanState *planstate; /* tree of per-plan-node state */
+ struct ExplainState *pe_es; /* progressive explain state if enabled */
Should you use the same name pattern here? pestate, for example.
Done.
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ExplainHash)
Could you use a specific name? Even if you keep the proposed name, you
should
use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.
Changed to ProgressiveExplainHash.
You don't need progressiveExplainHashKey. Use pid as key directly.
Done.
I don't think last_print is accurate because it is not the time the
execution plan
is printed but the time it was updated. I suggest last_updated_time.
Changed from last_print to last_update. This is still open for discussion.
I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
better name.
Used to be compatible with the ProcArray (that is also a hash). But what you
proposed is better indeed. Changed.
I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because
it
doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
PROGRESSIVE_EXPLAIN_AVAIL_SIZE?
Changed to PROGRESSIVE_EXPLAIN_FREE_SIZE.
Fixed the wrong function names in the comments and changed the format of
those
comments in function headers to be comptible with other functions in
explain.c.
+ /* state related to progressive explains */ + struct PlanState *pe_curr_node; + struct Instrumentation *pe_local_instr; + dsa_area *pe_a;
Could you add some comments saying what each of these variables are for?
Done.
I didn't experiment but I was wondering if there is a way to avoid the
duplicates that you added to avoid the overhead.
You mean the local instrumentation object reused for each node?
Rafael.
Attachments:
v9-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v9-0001-Proposal-for-progressive-explains.patchDownload+1561-71
Sending a new version as rebase was required.
Rafael.
Attachments:
v10-0001-Proposal-for-progressive-explains.patchapplication/octet-stream; name=v10-0001-Proposal-for-progressive-explains.patchDownload+1563-71
On Wed, Mar 19, 2025 at 1:47 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
Sending a new version as rebase was required.
Reading this thread, it seems to me that there has been a good deal of
discussion about things like the name of the feature and what the UI
ought to be, but not much discussion of whether the feature is
actually safe, and not much detailed review of the code. I'm willing
to bet that everybody wants some version of this feature if we can
convince ourselves that it's not going to do horrible things like
cause server crashes, and that even people who don't get their first
choice in terms of how the feature is named or how the GUCs work will
still be pretty happy to have it overall. However, if it breaks stuff
and there's no easy way to fix the breakage, that's going to be a
problem.
In broad strokes, the danger here is that doing stuff in the middle of
query execution that we currently only do at the end of query
execution will turn out to be problematic. The biggest problem, I
think, is whether it's safe to do all of the things that EXPLAIN does
while we're at some random point in query execution. It looks to me
like the wrap/unwrap stuff is more-or-less consistent with previous
discussions of how a feature of this kind should work, though I don't
recall the previous discussion and I think the patch should contain
some comments about why it works the way that it works. I do notice
that WrapExecProcNodeWithExplain does not walk the ps->initPlan list,
which I think is an oversight.
Without having the prior discussion near to hand, I *think* that the
reason we wanted to do this wrap/unwrap stuff is to make it so that
the progressive EXPLAIN code could only execute when entering a new
plan node rather than at any random point inside of that plan node,
and that does seem a lot safer than the alternative. For example, I
think it means that we won't start trying to do progressive EXPLAIN
while already holding some random LWLock, which is very good. However,
this still means we could do a bunch of catalog access (and thus
potentially receive invalidations) at places where that can't happen
today. I'm not sure that's a problem -- the same thing could happen at
any place in the executor where we evaluate a user-supplied
expression, and there are many such places -- but it's certainly worth
a few senior people thinking real carefully about it and trying to
imagine whether there's any scenario in which it might break something
that works today.
One way in which this proposal seems safer than previous proposals is
that previous proposals have involved session A poking session B and
trying to get session B to emit an EXPLAIN on the fly with no prior
setup. That would be very useful, but I think it's more difficult and
more risky than this proposal, where all the configuration happens in
the session that is going to emit the EXPLAIN output. It knows from
the beginning that it's going to maybe be doing this, and so it can do
whatever setup it likes to accomplish that goal. So I think this
design looks pretty good from that point of view.
I don't understand how this would be safe against interrupts or
errors. If a running query is interrupted, what would cause
ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
why isn't that unsafe?
ExecProcNodeOriginal looks like it's basically the same thing as
ExecProcNodeReal, except that it's for a different purpose. But you
would be hard-pressed to guess which one is used for what purpose
based on the field names or the comments. Maybe it's also worth
worrying about whether this is a scalable design. Can we find a way to
use the existing fields here instead of having to add a new one?
The documentation for the progressive_explain = { off | explain |
analyze } option seems like it should go into more detail about how
the "explain" and "analyze" values are different. I'm not 100% sure I
know the answer, and I'm not the least-experienced person who will
ever read this documentation.
WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
invites confusion with MultiExecProcNode, to which it is unrelated.
I just went to some trouble to start breaking up the monolith that is
explain.c, so I'm not that happy about seeing this patch dump another
800+ lines of source code into that file. Probably we should have a
new source file for some or this, or maybe even more than one.
The changes to explain.h add three new data types. Two of them begin
with an uppercase letter and one with a lowercase letter. That seems
inconsistent. I also don't think that the declaration of char plan[]
is per PostgreSQL coding style. I believe we always write char
plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
other than plan, since it's really a string-ified explain-y kind of
thing, not literally a Plan. Also, can we please not have structure
members with single letter names? "h" and "p" are going to be
completely ungreppable, and I like grepping.
It looks very strange to me that ProgressiveExplainPrint() seems to
have a theoretical possibility of generating the output and then
throwing it away if we end up with entry == NULL. I guess maybe that
case is not supposed to happen because ProgressiveExplainInit() is
supposed to create the entry, but then why isn't this an elog(ERROR)
or something instead of a no-op?
It seems like when we replace a longer entry with a shorter one, we
forget that it was originally longer. Hence, if the length of a
progressive EXPLAIN is alternately 2922 characters and 2923
characters, we'll reallocate on every other progressive EXPLAIN
instead of at most once.
I'll try to look at this some more tomorrow. It seems like a very
interesting patch, but time is very short for this release and it
doesn't look to me like we have all the kinks sorted out here just
yet.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Robert,
Thanks for sparing part of your precious time to look at the patch.
I acknowledge it is a very complex one. Since you're going to take
another look, providing some preliminary comments related to some
of the implementation concerns.
I don't understand how this would be safe against interrupts or
errors. If a running query is interrupted, what would cause
ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
why isn't that unsafe?
The strategy I used here is to use a MemoryContextCallback
(ProgressiveExplainReleaseFunc), configured in the memory context
where the query is being executed, being responsible for calling
ProgressiveExplainCleanup() if the query doesn't end gracefully.
It looks very strange to me that ProgressiveExplainPrint() seems to
have a theoretical possibility of generating the output and then
throwing it away if we end up with entry == NULL. I guess maybe that
case is not supposed to happen because ProgressiveExplainInit() is
supposed to create the entry, but then why isn't this an elog(ERROR)
or something instead of a no-op?
Agreed. Will fix this.
It seems like when we replace a longer entry with a shorter one, we
forget that it was originally longer. Hence, if the length of a
progressive EXPLAIN is alternately 2922 characters and 2923
characters, we'll reallocate on every other progressive EXPLAIN
instead of at most once.
Are you talking about re-printing the plan in the same query execution?
The logic for the code, using your example, would be to allocate 2922 +
PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
alternate between 2922 and 2923 no additional allocation will be done.
A reallocation will be needed only if the plan length ends up exceeding
2922+4096. At the end of query execution (or cancellation) that DSA will
be freed and a next query execution will have to allocate again using the
same logic.
Regarding the execProcNode wrapper strategy. It used it precisely because
of the discussion in that other patch. I actually tried not using it here,
and call ProgressiveExplainPrint() in the timeout callback. This resulted
in sporadic crashes, confirming the suspicion that it is not a good
idea.
Regarding all other comments related to variable/function names and having
the feature in a separate file, agree with all the comments. Will send a
new version with the fixes.
Rafael.
On Wed, Mar 19, 2025 at 6:53 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
The strategy I used here is to use a MemoryContextCallback
(ProgressiveExplainReleaseFunc), configured in the memory context
where the query is being executed, being responsible for calling
ProgressiveExplainCleanup() if the query doesn't end gracefully.
Thanks for the pointer. I'm a bit skeptical about what's going on here
in ProgressiveExplainReleaseFunc(). It seems like we're depending on
shared memory to tell us whether we need to do purely backend-local
cleanup, like calling disable_timeout() and resetting
ProgressiveExplainPending and activeQueryDesc. I would have expected
us to keep track in local memory of whether this kind of work needs to
be done. It seems roundabout to take an LWLock, do a hash table lookup
to see if there's an entry there, release the LWLock, and then very
shortly thereafter take the lock a second time to release the entry
that we now know is there.
The comment in ProgressiveExplainCleanup about only detaching from the
DSA if the query ended gracefully is not ideal from my point of view
because it says what the code does instead of why the code does that
thing. Also, the function is seemingly called with queryDesc as an
argument not because you need it for anything but because you're going
to test whether it's null. In that case, you could just pass a
Boolean. Even then, something seems odd about this: why do we have to
be holding ProgressiveExplainHashLock to dsa_detach the
somewhat-inscrutably named area pe_a? And why are we not detaching it
in case of error?
I am wondering why you chose this relatively unusual error cleanup
strategy. What I would have done is invent AtEOXact_ProgressiveExplain
and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
because it doesn't need separate handling for transactions and
subtransactions, but it's so unlike what we do in most places that
it's hard for me to tell whether it's correct. I feel like the
approach you've chosen here would make sense if what we wanted to do
was basically release memory or some memory-like resource associated
closely with the context -- e.g. expandedrecord.c releases a
TupleDesc, but this is doing more than that.
I think the effect of this choice is that cleanup of the
progressive-EXPLAIN stuff happens much later than it normally would.
Most of the time, in the case of an abort, we would want
AbortTransaction() to release as many resources as possible, leaving
basically no work to do at CleanupTransaction() time. This is so that
if a user types "SELECT 1/0;" we release resources, as far as
possible, right away, and don't wait for them to subsequently type
"ROLLBACK;". The transaction lives on only as a shell. But these
resources, if I'm reading this right, are going to stick around until
the transaction is actually rolled back, because memory is not freed
until CleanupTransaction() time. I wonder what happens if a query
inside of an explicit transaction aborts after putting an entry in the
progressive-explain view. My guess is that the entry will stick around
until the actual rollback happens.
In fact, now that I think about it, this is probably why we don't
dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the
resource owner cleanup will have already released the DSA segments
long before the memory context is deleted.
I'm sort of inclined to say that this should be rewritten to do error
cleanup in a more normal way. It's probably more code to do it that
way, but I think having it be more similar to other subsystems is
probably worth quite a bit.
It seems like when we replace a longer entry with a shorter one, we
forget that it was originally longer. Hence, if the length of a
progressive EXPLAIN is alternately 2922 characters and 2923
characters, we'll reallocate on every other progressive EXPLAIN
instead of at most once.Are you talking about re-printing the plan in the same query execution?
Yes.
The logic for the code, using your example, would be to allocate 2922 +
PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
alternate between 2922 and 2923 no additional allocation will be done.
A reallocation will be needed only if the plan length ends up exceeding
2922+4096. At the end of query execution (or cancellation) that DSA will
be freed and a next query execution will have to allocate again using the
same logic.
It seems to me that if ProgressiveExplainPrint() reaches /* Update
shared memory with new data */ without reallocating,
strlen(pe_data->plan) can be reduced. On the next trip through the
function, we don't know whether the string we're seeing is the
original string -- for which strlen()+PROGRESSIVE_EXPLAIN_FREE_SIZE)
gives us the original allocation size -- or whether the string we're
seeing is a shorter one that was copied over the original, longer
string. PROGRESSIVE_EXPLAIN_FREE_SIZE is big enough that this probably
isn't much of a problem in practice, because consecutive EXPLAIN
outputs for the same query probably won't vary in size by enough to
cause any reallocation. Looking at this more carefully, I think that
the query plan would have to shrink in size by >4kB and then expand
again in order to trigger reallocation, which seems unlikely. But it
still seems unclean to me. Normally, we track how much space we have
actually allocated explicitly, instead of reconstructing that number
from something else, especially something that isn't guaranteed to
produce an accurate result. I think you should just store the number
of available payload bytes at the beginning of the chunk and then
reallocate if it isn't big enough to hold the payload that you have
got.
Regarding the execProcNode wrapper strategy. It used it precisely because
of the discussion in that other patch. I actually tried not using it here,
and call ProgressiveExplainPrint() in the timeout callback. This resulted
in sporadic crashes, confirming the suspicion that it is not a good
idea.
Makes sense, but we need adequate documentation of what we did and why
it works (or at least why we think it works).
Another thing I just noticed is that pg_stat_progress_explain() uses
beentry->st_procpid == ped->pid as the permissions check, but a more
typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know
that's only in pgstatfuncs.c, but I think it would be OK to duplicate
the associated test here (i.e. has_privs_of_role(GetUserId(),
ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I
don't really see a reason why this shouldn't use the same permission
rules as other pg_stat_ things, in particular pg_stat_get_activity().
--
Robert Haas
EDB: http://www.enterprisedb.com