Making auto_explain more useful / convenient

Started by Vladimir Churyukinover 2 years ago5 messageshackers

vladimir@churyukin.com

over 2 years ago

Hello,

I'm not sure if these ideas were circulated before or not.
We use auto_explain a lot to investigate slow/problematic queries.
One of the main issues with its usability comes from the fact that EXPLAIN
output is logged rather than returned to the caller in some way. If you
have a large cluster with lots of replicas, there is also an extra
inconvenience of log accumulation, search, etc.
Why not have an option to return EXPLAIN results as a NoticeResponse
instead? That would make its usage more convenient.

Another thing is tangentially related...
I think it may be good to have a number of options to generate
significantly shorter output similar to EXPLAIN. EXPLAIN is great, but
sometimes people need more concise and specific information, for example
total number of buffers and reads by certain query (this is pretty common),
whether or not we had certain nodes in the plan (seq scan, scan of certain
index(es)), how bad was cardinality misprediction on certain nodes, etc.
It's not totally clear yet what would be the best way to define those
rules, but I think we can come up with something reasonable. Logging or
returning shorter messages like that can cause less overhead than logging
full EXPLAIN and can potentially allow for better query monitoring overall.

Do you see any potential issues with implementing those? Of course there
should be more details, like what kind of configuration parameters to add,
how to define rules for the 2nd case, etc. Just wanted to check if there
are any objections in general.

Thank you,
-Vladimir Churyukin.

Sergei Kornilov

sk@zsrv.org

over 2 years ago

In reply to: Vladimir Churyukin (#1)

Re:Making auto_explain more useful / convenient

Hello

auto_explain.log_level is available since postgresql 12.

postgres=# load 'auto_explain';
LOAD
postgres=# set auto_explain.log_min_duration to 0;
SET
postgres=# set auto_explain.log_level to 'notice';
SET
postgres=# select 1;
NOTICE: duration: 0.010 ms plan:
Query Text: select 1;
Result (cost=0.00..0.01 rows=1 width=4)
?column?
----------
1

regards, Sergei

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Vladimir Churyukin (#1)

Re: Making auto_explain more useful / convenient

Vladimir Churyukin <vladimir@churyukin.com> writes:

Why not have an option to return EXPLAIN results as a NoticeResponse
instead? That would make its usage more convenient.

That seems quite useless to me, and likely actually counterproductive.
If you are manually investigating query performance, you can just use
EXPLAIN directly. The point of auto_explain, ISTM, is to capture info
about queries issued by automated applications. So something like the
above could only work if you taught every one of your applications to
capture the NOTICE output, separate it from random other NOTICE
output, and then (probably) log it somewhere central for later
inspection. That's a lot of code to write, and at the end you'd
only have effectively duplicated existing tooling such as pgbadger.
Also, what happens in applications you forgot to convert?

Another thing is tangentially related...
I think it may be good to have a number of options to generate
significantly shorter output similar to EXPLAIN. EXPLAIN is great, but
sometimes people need more concise and specific information, for example
total number of buffers and reads by certain query (this is pretty common),
whether or not we had certain nodes in the plan (seq scan, scan of certain
index(es)), how bad was cardinality misprediction on certain nodes, etc.

Maybe, but again I'm a bit skeptical. IME you frequently don't know
what you're looking for until you've seen the bigger picture. Zeroing
in on details like this could be pretty misleading.

regards, tom lane

Vladimir Churyukin

vladimir@churyukin.com

over 2 years ago

In reply to: Sergei Kornilov (#2)

Re: Making auto_explain more useful / convenient

Thank you, that answers the first part of my question.

On Sat, Nov 11, 2023 at 2:43 AM Sergei Kornilov <sk@zsrv.org> wrote:

Show quoted text

Hello

auto_explain.log_level is available since postgresql 12.

postgres=# load 'auto_explain';
LOAD
postgres=# set auto_explain.log_min_duration to 0;
SET
postgres=# set auto_explain.log_level to 'notice';
SET
postgres=# select 1;
NOTICE: duration: 0.010 ms plan:
Query Text: select 1;
Result (cost=0.00..0.01 rows=1 width=4)
?column?
----------
1

regards, Sergei

Vladimir Churyukin

vladimir@churyukin.com

over 2 years ago

In reply to: Tom Lane (#3)

Re: Making auto_explain more useful / convenient

On Sat, Nov 11, 2023 at 7:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Vladimir Churyukin <vladimir@churyukin.com> writes:

Why not have an option to return EXPLAIN results as a NoticeResponse
instead? That would make its usage more convenient.

That seems quite useless to me, and likely actually counterproductive.
If you are manually investigating query performance, you can just use
EXPLAIN directly. The point of auto_explain, ISTM, is to capture info
about queries issued by automated applications. So something like the
above could only work if you taught every one of your applications to
capture the NOTICE output, separate it from random other NOTICE
output, and then (probably) log it somewhere central for later
inspection. That's a lot of code to write, and at the end you'd
only have effectively duplicated existing tooling such as pgbadger.
Also, what happens in applications you forgot to convert?

Sergey Kornilov just gave the right answer above in the thread for this one.
Unfortunately, there are a lot of scenarios where you can't use pgbadger or
any other log analysis or it's not convenient.
There are a bunch of cloud hosted forks of postgres for example, not all of
them give you this functionality.
In AWS for example you need to download all the logs first, which
complicates it significantly.
The goal of this is not investigating performance of a single query but
rather constant monitoring of a bunch (or all) queries, so you can detect
plan degradations right away.

Another thing is tangentially related...
I think it may be good to have a number of options to generate
significantly shorter output similar to EXPLAIN. EXPLAIN is great, but
sometimes people need more concise and specific information, for example
total number of buffers and reads by certain query (this is pretty

common),

whether or not we had certain nodes in the plan (seq scan, scan of

certain

index(es)), how bad was cardinality misprediction on certain nodes, etc.

Maybe, but again I'm a bit skeptical. IME you frequently don't know
what you're looking for until you've seen the bigger picture. Zeroing
in on details like this could be pretty misleading.

If you don't know what you're looking for, then it's not very useful, I
agree.
But in many cases you know. There are certain generic "signs of trouble"
that you can detect by
the amount of data the query processor scans, by cache hit rate for certain
queries. presence of seq scans or scans of certain indexes,
large differences between predicted and actual rows, some other stuff that
may be relevant to your app/queries specifically that you want to monitor.
We're already doing similar analysis on our side (a multi-terabyte db
cluster with hundreds of millions to billions queries running daily).
But it's not efficient enough because:
1. the problem I mentioned above, access to logs is limited on cloud
environments
2. explain output could be huge, it causes performance issues because of
its size. compact output is much more preferable for mass processing
(it's even more important if this output is to notice messages rather than
to logs, that's why I said it's tangentially related)

Since it seems the notice output is already possible, half of the problem
is solved already.
I'll try to come up with possible options for more compact output
then, unless you think it's completely futile.

thank you,
-Vladimir Churyukin