Set query_id for query contained in utility statement

Started by Anthonin Bonnefoyover 1 year ago37 messageshackers

anthonin.bonnefoy@datadoghq.com

over 1 year ago

Hi all,

Some utility statements like Explain, CreateTableAs and DeclareCursor
contain a query which will be planned and executed. During post parse,
only the top utility statement is jumbled, leaving the contained query
without a query_id set. Explain does the query jumble in ExplainQuery
but for the contained query but CreateTableAs and DeclareCursor were
left with unset query_id.

This leads to extensions relying on query_id like pg_stat_statements
to not be able to track those nested queries as the query_id was 0.

This patch fixes this by recursively jumbling queries contained in
those utility statements during post_parse, setting the query_id for
those contained queries and removing the need for ExplainQuery to do
it for the Explain statements.

Regards,
Anthonin Bonnefoy

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#1)

Re: Set query_id for query contained in utility statement

I've realised my initial approach was wrong, calling the post parse
for all nested queries in analyze.c prevents extension like pgss to
correctly track the query's nesting level.

I've changed the approach to replicate what's done in ExplainQuery to
both CreateTableAs and DeclareCursor: Jumble the query contained by
the utility statement and call the post parse hook before it is
planned or executed. Additionally, explain's nested query can itself
be a CreateTableAs or DeclareCursor which also needs to be jumbled.
The issue is visible when explaining a CreateTableAs or DeclareCursor
Query, the queryId is missing despite the verbose.

EXPLAIN (verbose) create table test_t as select 1;
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=4)
Output: 1

Post patch, the query id is correctly displayed.

EXPLAIN (verbose) create table test_t as select 1;
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=4)
Output: 1
Query Identifier: 2800308901962295548

Regards,
Anthonin Bonnefoy

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#2)

Re: Set query_id for query contained in utility statement

On Mon, Aug 5, 2024 at 3:19 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

I've realised my initial approach was wrong, calling the post parse
for all nested queries in analyze.c prevents extension like pgss to
correctly track the query's nesting level.

I've changed the approach to replicate what's done in ExplainQuery to
both CreateTableAs and DeclareCursor: Jumble the query contained by
the utility statement and call the post parse hook before it is
planned or executed. Additionally, explain's nested query can itself
be a CreateTableAs or DeclareCursor which also needs to be jumbled.
The issue is visible when explaining a CreateTableAs or DeclareCursor
Query, the queryId is missing despite the verbose.

EXPLAIN (verbose) create table test_t as select 1;
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=4)
Output: 1

Post patch, the query id is correctly displayed.

EXPLAIN (verbose) create table test_t as select 1;
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=4)
Output: 1
Query Identifier: 2800308901962295548

The output is what we expect.

now after applying your patch.
SELECT pg_stat_statements_reset();
EXPLAIN (verbose) create table test_t as select 1;
EXPLAIN (verbose) create table test_t as select 2;
SELECT queryid, left(query, 60),plans, calls, rows FROM
pg_stat_statements ORDER BY calls DESC LIMIT 5;

the output is:
queryid | left
| plans | calls | rows
----------------------+--------------------------------------------------------------+-------+-------+------
2800308901962295548 | EXPLAIN (verbose) create table test_t as
select 1; | 2 | 2 | 0
2093602470903273926 | EXPLAIN (verbose) create table test_t as
select $1 | 0 | 2 | 0
-2694081619397734273 | SELECT pg_stat_statements_reset()
| 0 | 1 | 1
2643570658797872815 | SELECT queryid, left(query, $1),plans, calls,
rows FROM pg_s | 1 | 0 | 0

"EXPLAIN (verbose) create table test_t as select 1;" called twice,
is that what we expect?

should first row, the normalized query becomes

EXPLAIN (verbose) create table test_t as select $1;

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: jian he (#3)

Re: Set query_id for query contained in utility statement

On Mon, Aug 26, 2024 at 5:26 AM jian he <jian.universality@gmail.com> wrote:

queryid | left
| plans | calls | rows
----------------------+--------------------------------------------------------------+-------+-------+------
2800308901962295548 | EXPLAIN (verbose) create table test_t as
select 1; | 2 | 2 | 0
2093602470903273926 | EXPLAIN (verbose) create table test_t as
select $1 | 0 | 2 | 0

"EXPLAIN (verbose) create table test_t as select 1;" called twice,
is that what we expect?

pg_stat_statements reports nested queries and toplevel queries
separately. You can check with the toplevel column.
2800308901962295548 is the nested ctas part while 2093602470903273926
is the top explain utility statement (which explain why there's 0
plans since it's an utility statement). Since the explain ctas query
was called twice, it will be reported as 2 toplevel queries and 2
nested queries.

should first row, the normalized query becomes
EXPLAIN (verbose) create table test_t as select $1;

Good point, the issue in this case was the nested query was stored by
pg_stat_statements during the ExecutorEnd hook. This hook doesn't have
the jstate and parseState at that point so pg_stat_statements can't
normalize the query.

I've modified the patch to fix this. Instead of just jumbling the
query in ExplainQuery, I've moved jumbling in ExplainOneUtility which
already has specific code to handle ctas and dcs. Calling the post
parse hook here allows pg_stat_statements to store the normalized
version of the query for this queryid and nesting level.

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#4)

Re: Set query_id for query contained in utility statement

On Mon, Aug 26, 2024 at 4:55 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

/* Evaluate parameters, if any */
if (entry->plansource->num_params)
{
- ParseState *pstate;
-
- pstate = make_parsestate(NULL);
- pstate->p_sourcetext = queryString;

you deleted the above these lines, but passed (ParseState *pstate) in
ExplainExecuteQuery
how do you make sure ExplainExecuteQuery passed (ParseState *pstate)
the p_next_resno is 1 and p_resolve_unknowns is true.
maybe we can add some Asserts like in ExplainExecuteQuery

/* Evaluate parameters, if any */
if (entry->plansource->num_params)
{
Assert(pstate->p_next_resno == 1);
Assert(pstate->p_resolve_unknowns == 1);
}

also it's ok to use passed (ParseState *pstate) for
{
estate = CreateExecutorState();
estate->es_param_list_info = params;
paramLI = EvaluateParams(pstate, entry, execstmt->params, estate);
}
?
I really don't know.

some of the change is refactoring, maybe you can put it into a separate patch.

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: jian he (#5)

Re: Set query_id for query contained in utility statement

On Tue, Aug 27, 2024 at 11:14 AM jian he <jian.universality@gmail.com> wrote:

also it's ok to use passed (ParseState *pstate) for
{
estate = CreateExecutorState();
estate->es_param_list_info = params;
paramLI = EvaluateParams(pstate, entry, execstmt->params, estate);
}
?
I really don't know.

some of the change is refactoring, maybe you can put it into a separate patch.

Thanks for the review. I think the parser state is mostly used for the
error callbacks and parser_errposition but I'm not 100% sure. Either
way, you're right and it probably shouldn't be in the patch. I've
modified the patch to restrict the changes to only add the necessary
query jumble and post parse hook calls.

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#6)

Re: Set query_id for query contained in utility statement

PREPARE test_prepare_pgss1(int, int) AS select generate_series($1, $2);
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
EXECUTE test_prepare_pgss1 (1,2);
EXECUTE test_prepare_pgss1 (1,3);
SELECT toplevel, calls, query, queryid, rows FROM pg_stat_statements
ORDER BY query COLLATE "C", toplevel;
SELECT pg_stat_statements_reset() IS NOT NULL AS t;

---the above works just fine. just for demo purpose

explain(verbose) EXECUTE test_prepare_pgss1(1, 2);
explain(verbose) EXECUTE test_prepare_pgss1(1, 3);

SELECT toplevel, calls, query, queryid, rows FROM pg_stat_statements
ORDER BY query COLLATE "C", toplevel;
toplevel | calls | query
| queryid | rows
----------+-------+------------------------------------------------------------------------+----------------------+------
f | 2 | PREPARE test_prepare_pgss1(int, int) AS select
generate_series($1, $2) | -3421048434214482065 | 0
t | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
| 3366652201587963567 | 1
t | 0 | SELECT toplevel, calls, query, queryid, rows FROM
pg_stat_statements +| -6410939316132384446 | 0
| | ORDER BY query COLLATE "C", toplevel
| |
t | 1 | explain(verbose) EXECUTE test_prepare_pgss1(1, 2)
| 7618807962395633001 | 0
t | 1 | explain(verbose) EXECUTE test_prepare_pgss1(1, 3)
| -2281958002956676857 | 0

Is it possible to normalize top level utilities explain query, make
these two have the same queryid?
explain(verbose) EXECUTE test_prepare_pgss1(1, 2);
explain(verbose) EXECUTE test_prepare_pgss1(1, 3);

I guess this is a corner case.

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: jian he (#7)

Re: Set query_id for query contained in utility statement

On Mon, Sep 30, 2024 at 5:14 PM jian he <jian.universality@gmail.com> wrote:

Is it possible to normalize top level utilities explain query, make
these two have the same queryid?
explain(verbose) EXECUTE test_prepare_pgss1(1, 2);
explain(verbose) EXECUTE test_prepare_pgss1(1, 3);

I guess this is a corner case.

This seems to be a known issue. The test_prepare_pgss1's parameters
are A_Const nodes. Those nodes have a custom query jumble which
doesn't record location[1]https://github.com/postgres/postgres/blob/cf4401fe6cf56811343edcad29c96086c2c66481/src/backend/nodes/queryjumblefuncs.c#L323-L355 and thus can't be normalised by pgss.

That could be fixed by replacing the switch on nodeTag by
JUMBLE_LOCATION(location) but this will impact a lot of DDL queries
and the result doesn't look great (for example, "BEGIN TRANSACTION NOT
DEFERRABLE, READ ONLY, READ WRITE, DEFERRABLE" would be normalised as
"BEGIN TRANSACTION $1 DEFERRABLE, $2 ONLY, $3 WRITE, $4")

Looking at the commit for the A_Const's jumble, this is mentioned by Michael[2]/messages/by-id/Y9+HuYslMAP6yyPb@paquier.xyz:

(FWIW, I'd like to think that there is an argument to normalize the
A_Const nodes for a portion of the DDL queries, by ignoring their
values in the query jumbling and mark a location, which would be
really useful for some workloads, but that's a separate discussion I
am keeping for later.)

I haven't found any recent discussion but this should live in a
different thread as this is a separate issue.

[1]: https://github.com/postgres/postgres/blob/cf4401fe6cf56811343edcad29c96086c2c66481/src/backend/nodes/queryjumblefuncs.c#L323-L355
[2]: /messages/by-id/Y9+HuYslMAP6yyPb@paquier.xyz

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Anthonin Bonnefoy (#8)

Re: Set query_id for query contained in utility statement

On Tue, Oct 01, 2024 at 09:26:40AM +0200, Anthonin Bonnefoy wrote:

This seems to be a known issue. The test_prepare_pgss1's parameters
are A_Const nodes. Those nodes have a custom query jumble which
doesn't record location[1] and thus can't be normalised by pgss.

That could be fixed by replacing the switch on nodeTag by
JUMBLE_LOCATION(location) but this will impact a lot of DDL queries
and the result doesn't look great (for example, "BEGIN TRANSACTION NOT
DEFERRABLE, READ ONLY, READ WRITE, DEFERRABLE" would be normalised as
"BEGIN TRANSACTION $1 DEFERRABLE, $2 ONLY, $3 WRITE, $4")

Yeah, I've made peace with myself regarding the fact that adding a
location tracker to A_Const is just a very bad idea and that we should
never do that, ever. So what I proposed on this thread is a no-go. I
am not saying that there is no solution, just that this solution is a
bad one.

You can see the extent of the damages this would cause with the diffs
generated in pg_stat_statements. If there is a gap in the tests for
commands using A_Const nodes compared to what's on HEAD, we should
expand that to properly track how normalization would apply to each
one of them (see the recent one for SET queries, for example). This
is really useful when manipulating the parse nodes and query jumbling
with some pg_node_attr().
--
Michael

#10

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Anthonin Bonnefoy (#6)

Re: Set query_id for query contained in utility statement

On Fri, Aug 30, 2024 at 09:37:03AM +0200, Anthonin Bonnefoy wrote:

Thanks for the review. I think the parser state is mostly used for the
error callbacks and parser_errposition but I'm not 100% sure. Either
way, you're right and it probably shouldn't be in the patch. I've
modified the patch to restrict the changes to only add the necessary
query jumble and post parse hook calls.

So this adds four calls post_parse_analyze_hook, leading to more data
added to pgss for non-toplevel queries: one in createas.c for the CTAS
internal query, one in portalcmds.c for the inner query of DECLARE,
and two for utilities in EXPLAIN.

This is a rather old problem, trying to bring more consistency across
the board, and it comes down to this bit with EXPLAIN, that can be
seen on HEAD:
SET pg_stat_statements.track = 'all';
explain (costs off) select 1;

FWIW, I've always found this case with EXPLAIN with two entries
confusing, so what's the advantage in trying to apply this rule for
the rest? We know that EXPLAIN, DECLARE and CTAS run a query attached
to their DDL, hence isn't it sufficient to register a single entry for
the top-level query, then nothing for the internal one. The
documentation tells about inner queries when pg_stat_statements.track
= all, like the ones in PL functions, DO blocks, because we have a
clear view of the query string, creating a unique one-one mapping
between a query string and its ID. This patch maps the same query
string to more than one query ID, spreading that.

So it seems that there are arguments for not doing what this patch
proposes, but also make sure that EXPLAIN logs a single entry, not
two currently when using pg_stat_statements.track = all.

Side note. It looks like the patch is forgetting about CREATE VIEW
and CREATE MATERIALIZED VIEW, creating only a top-level entry when
running these utilities.
--
Michael

#11

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Michael Paquier (#10)

Re: Set query_id for query contained in utility statement

On Wed, Oct 2, 2024 at 6:39 AM Michael Paquier <michael@paquier.xyz> wrote:

FWIW, I've always found this case with EXPLAIN with two entries
confusing, so what's the advantage in trying to apply this rule for
the rest? We know that EXPLAIN, DECLARE and CTAS run a query attached
to their DDL, hence isn't it sufficient to register a single entry for
the top-level query, then nothing for the internal one. The
documentation tells about inner queries when pg_stat_statements.track
= all, like the ones in PL functions, DO blocks, because we have a
clear view of the query string, creating a unique one-one mapping
between a query string and its ID. This patch maps the same query
string to more than one query ID, spreading that.

So it seems that there are arguments for not doing what this patch
proposes, but also make sure that EXPLAIN logs a single entry, not
two currently when using pg_stat_statements.track = all.

I agree that tracking 2 identical statements with different queryIds
and nesting levels is very confusing. On the other hand, from an
extension developer point of view (not necessarily limited to
pg_stat_statements), I would like to have the queryId available and
the post_parse hook called so the query can be normalised and tracked
in a hashmap.

However, the repeated statements did bug me a lot so I took a stab at
trying to find a possible solution. I made an attempt in 0001 by
tracking the statements' locations of explainable statements (Select,
Insert, Update, Merge, Delete...) during parse and propagate them in
the generated Query during transform. With the change, we now have the
following result:

SET pg_stat_statements.track = 'all';
explain (costs off) select 1;
select 1;
select queryid, calls, query, toplevel from pg_stat_statements
where query ~'select \$1';
queryid | calls | query | toplevel
---------------------+-------+-------------------------------+----------
2800308901962295548 | 1 | select $1 | t
2800308901962295548 | 1 | select $1; | f
3797185112479763582 | 1 | explain (costs off) select $1 | t

The top level and nested select statement now share both the same
queryId and query string. The additional ';' for the nested query is
due to not having the statement length and taking the whole
statement.

Side note. It looks like the patch is forgetting about CREATE VIEW
and CREATE MATERIALIZED VIEW, creating only a top-level entry when
running these utilities.

I've updated 0002 to handle CREATE MATERIALIZED VIEW, CREATE VIEW
doesn't generate nested statements. I've also realised that refresh
materialized view has a similar problem to explain. The first refresh
called when the matview is created will have the correct select
substring. Subsequent refresh call will use the refresh utility
statement as the query string:

-- Create the view
CREATE MATERIALIZED VIEW test AS SELECT * FROM generate_series(1, 1000) as id;
-- Reset pgss and refresh
select * from pg_stat_statements_reset();
REFRESH MATERIALIZED VIEW test;
select queryid, calls, query, toplevel from pg_stat_statements;
queryid | calls | query
| toplevel
----------------------+-------+------------------------------------------+----------
8227191975572355654 | 1 | REFRESH MATERIALIZED VIEW test | t
-2908407163726309935 | 1 | REFRESH MATERIALIZED VIEW test; | f
-1361326859153559975 | 1 | select * from pg_stat_statements_reset() | t

I've tried to improve this behaviour in 0003 where the view's
definition is used as query string instead of the refresh utility
statement.

#12

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#11)

Re: Set query_id for query contained in utility statement

On Fri, Oct 4, 2024 at 5:05 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

I agree that tracking 2 identical statements with different queryIds
and nesting levels is very confusing. On the other hand, from an
extension developer point of view (not necessarily limited to
pg_stat_statements), I would like to have the queryId available and
the post_parse hook called so the query can be normalised and tracked
in a hashmap.

However, the repeated statements did bug me a lot so I took a stab at
trying to find a possible solution. I made an attempt in 0001 by
tracking the statements' locations of explainable statements (Select,
Insert, Update, Merge, Delete...) during parse and propagate them in
the generated Query during transform. With the change, we now have the
following result:

SET pg_stat_statements.track = 'all';
explain (costs off) select 1;
select 1;
select queryid, calls, query, toplevel from pg_stat_statements
where query ~'select \$1';
queryid | calls | query | toplevel
---------------------+-------+-------------------------------+----------
2800308901962295548 | 1 | select $1 | t
2800308901962295548 | 1 | select $1; | f
3797185112479763582 | 1 | explain (costs off) select $1 | t

The top level and nested select statement now share both the same
queryId and query string. The additional ';' for the nested query is
due to not having the statement length and taking the whole
statement.

about v5 0001
select_with_parens:
'(' select_no_parens ')' { $$ = $2; }
| '(' select_with_parens ')' { $$ = $2; }
;

query "select $1);" looks weird. not sure how to improve it,
or this should be the expected behavior?

in gram.y
| values_clause { $$ = $1; }
| TABLE relation_expr
for TABLE relation_expr
we can add `n->location = @1;`

for values_clause we can do also,
then in transformValuesClause do the same as in transformSelectStmt.

#13

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: jian he (#12)

Re: Set query_id for query contained in utility statement

On Fri, Oct 04, 2024 at 08:16:00PM +0800, jian he wrote:

about v5 0001
select_with_parens:
'(' select_no_parens ')' { $$ = $2; }
| '(' select_with_parens ')' { $$ = $2; }
;

toplevel | calls | query
----------+-------+-------------------------------------------------------
t | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
t | 0 | SELECT toplevel, calls, query FROM pg_stat_statements+
| | ORDER BY query COLLATE "C", toplevel
t | 1 | explain (select $1)
f | 1 | select $1);

query "select $1);" looks weird. not sure how to improve it,
or this should be the expected behavior?

GOod point, this is confusing. The point is that having only
stmt_location is not enough to detect where the element in the query
you want to track is because it only points at its start location in
the full query string. In an ideal world, what we should have is its
start and end, pass it down to pgss_store(), and store only this
subquery between the start and end positions in the stats entry.
Making that right through the parser may be challenging, though.

This concept is something that's perhaps larger than this thread? I
think that we want the same kind of thing for values in IN() and ANY()
clauses, where we want to track an area for a single normalization
parameter, perhaps with a separate node_attr. I am not sure if using
the same trackers would make sense, so I am just waving hands a bit
here, but the concepts required are quite close.

Saying that, a patch set implemented this way would ensure a strict
1:1 mapping between a query ID and the internal query in these EXPLAIN
and CREATE commands, which would be good.

The first step should be IMO to expand the tests of pgss and track all
the behaviors we have historically in the tree about all that. Then,
it becomes much easier to track how much we want to tweak them
depending on if pgss.track is set to "top" or "all", and easier to see
how a behavior changes when manipulating the parse node structures
with location data.
--
Michael

#14

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Michael Paquier (#13)

Re: Set query_id for query contained in utility statement

On Mon, Oct 7, 2024 at 1:39 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Oct 04, 2024 at 08:16:00PM +0800, jian he wrote:

about v5 0001
select_with_parens:
'(' select_no_parens ')' { $$ = $2; }
| '(' select_with_parens ')' { $$ = $2; }
;

toplevel | calls | query
----------+-------+-------------------------------------------------------
t | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
t | 0 | SELECT toplevel, calls, query FROM pg_stat_statements+
| | ORDER BY query COLLATE "C", toplevel
t | 1 | explain (select $1)
f | 1 | select $1);

query "select $1);" looks weird. not sure how to improve it,
or this should be the expected behavior?

GOod point, this is confusing. The point is that having only
stmt_location is not enough to detect where the element in the query
you want to track is because it only points at its start location in
the full query string. In an ideal world, what we should have is its
start and end, pass it down to pgss_store(), and store only this
subquery between the start and end positions in the stats entry.
Making that right through the parser may be challenging, though.

turns out UPDATE/DELETE/MERGE and other utilities stmt cannot have
arbitrary parenthesis with EXPLAIN.

attached patches can solve this specific problem.
(based on v5-0001-Track-location-to-extract-relevant-part-in-nested.patch)

the main gotcha is to add location information for the statement that
is being explained.
typedef struct ExplainStmt
{
NodeTag type;
Node *query; /* the query (see comments above) */
List *options; /* list of DefElem nodes */
ParseLoc location; /* location of the statement being explained */
} ExplainStmt;

explain select 1;
explain (select 1);
explain (((select 1)));

the above 3 explained select queries will be normalized to one select query.

#15

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: jian he (#14)

Re: Set query_id for query contained in utility statement

On Mon, Oct 7, 2024 at 7:39 AM Michael Paquier <michael@paquier.xyz> wrote:

GOod point, this is confusing. The point is that having only
stmt_location is not enough to detect where the element in the query
you want to track is because it only points at its start location in
the full query string. In an ideal world, what we should have is its
start and end, pass it down to pgss_store(), and store only this
subquery between the start and end positions in the stats entry.
Making that right through the parser may be challenging, though.

One of the issues is that we don't track the length in the parser,
only location[1]https://github.com/postgres/postgres/blob/REL_17_STABLE/src/backend/parser/gram.y#L69-L79. The only place we can have some information about
the statement length (or at least, the location of the ';') is for
multi statement query.

On Mon, Oct 7, 2024 at 6:17 PM jian he <jian.universality@gmail.com> wrote:

turns out UPDATE/DELETE/MERGE and other utilities stmt cannot have
arbitrary parenthesis with EXPLAIN.

Yes, it is also possible to get the length of the Select statement
within parenthesis through the parser by using the location of ')' for
the select_no_parens.

the main gotcha is to add location information for the statement that
is being explained.

I've found that there are other possible issues with not having the
statement length and including the opening parenthesis won't be
enough. On HEAD, we have the following:

explain(verbose) SELECT 1, 2, 3\; explain SELECT 1, 2, 3, 4;
SELECT toplevel, query FROM pg_stat_statements
ORDER BY toplevel desc, query;
toplevel | query
----------+-----------------------------------------------------------------
t | SELECT pg_stat_statements_reset() IS NOT NULL AS t
t | explain SELECT $1, $2, $3, $4
t | explain(verbose) SELECT $1, $2, $3
f | explain(verbose) SELECT $1, $2, $3; explain SELECT 1, 2, 3, 4;
f | explain(verbose) SELECT 1, 2, 3; explain SELECT $1, $2, $3, $4;

The nested statement will have the whole query string. To fix this, we
need to propagate the statement length from the RawStmt (probably
using the ParserState?) and adjust the nested query's location and
length when the statement is transformed. I'm still working on the
details and edge cases on this.

[1]: https://github.com/postgres/postgres/blob/REL_17_STABLE/src/backend/parser/gram.y#L69-L79

#16

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#15)

Re: Set query_id for query contained in utility statement

Here is a new version of the patchset.

0001: Add tests to cover the current behaviour: Missing nested
statements for CreateTableAs, DeclareCursor and MaterializedViews,
nested statements reported by explain including the whole string
(multi statement or whole utility statement). I've tried to be
exhaustive, testing both all and top tracking, but I may have missed
some possible corner cases.

0002: Track the location of explainable statements. We keep RawStmt's
location and length in the ParseState and use it to compute the
transformed statement's length, this is done to handle the
multi-statement query issue. For SelectStmt, we also track the
statement length when select is inside parentheses and use it when
available.
For with clauses, I've switched to directly getting the correct
location from the parser with the 'if (@$ < 0) @$ = @2;' trick. Select
is handled differently and the location is updated in
insertSelectOptions.
Tracking the statement length as the benefit of having consistent
query string between the top and nested statement. A 'Select 1' should
be reported with the same string, without the ';' in both cases.

0003: Add query jumble and post_parse calls CreateTableAs,
DeclareCursor and MaterializedViews so they would report the nested
statements, like what explain is doing.

0004: On a materialized view refresh, fetch the view definition to
report as the executed statement. By default, it would report the
Refresh utility statement while the nested statement would be the
select statement.

0005: Report the nested query for Prepare statement. When executing a
prepared query, the whole Prepare utility statement is reported for
the nested query. This patch reuses the statement's location and
ParseState's length to extract the relevant part of the query string.

Something that's not clear to me, I've added location to SelectStmt,
InsertStmt, DeleteStmt, UpdateStmt and MergeStmt in parsenodes.h.
Should the location be tagged as query_jumble_ignore? Looking at other
nodes, it doesn't seem consistent whether the location has this tag or
not.

#17

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#16)

Re: Set query_id for query contained in utility statement

On Wed, Oct 9, 2024 at 4:49 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

Here is a new version of the patchset.

0001: Add tests to cover the current behaviour: Missing nested
statements for CreateTableAs, DeclareCursor and MaterializedViews,
nested statements reported by explain including the whole string
(multi statement or whole utility statement). I've tried to be
exhaustive, testing both all and top tracking, but I may have missed
some possible corner cases.

0002: Track the location of explainable statements. We keep RawStmt's
location and length in the ParseState and use it to compute the
transformed statement's length, this is done to handle the
multi-statement query issue. For SelectStmt, we also track the
statement length when select is inside parentheses and use it when
available.
For with clauses, I've switched to directly getting the correct
location from the parser with the 'if (@$ < 0) @$ = @2;' trick. Select
is handled differently and the location is updated in
insertSelectOptions.
Tracking the statement length as the benefit of having consistent
query string between the top and nested statement. A 'Select 1' should
be reported with the same string, without the ';' in both cases.

hi.

Query *
transformTopLevelStmt(ParseState *pstate, RawStmt *parseTree)
{
Query *result;
/* We're at top level, so allow SELECT INTO */
result = transformOptionalSelectInto(pstate, parseTree->stmt);
result->stmt_location = parseTree->stmt_location;
result->stmt_len = parseTree->stmt_len;
return result;
}
function call chain:
transformTopLevelStmt transformOptionalSelectInto transformStmt
transformSelectStmt

in transformSelectStmt we do
makeNode(Query), assign Query's stmt_len, stmt_location value.
if in transformSelectStmt we did it wrong, then
transformTopLevelStmt

result->stmt_location = parseTree->stmt_location;
result->stmt_len = parseTree->stmt_len;

will override values we've previously assigned at transformSelectStmt.

this feel not safe?

+qry->stmt_len = pstate->p_stmt_len - (stmt->location -
pstate->p_stmt_location);
i feel like, the comments didn't explain very well the difference between
stmt->location and pstate->p_stmt_location.
i know it's related to multi-statement, possibly through \;

SELECT pg_stat_statements_reset() IS NOT NULL AS t;
explain (values (1));
explain ((values (1)));
explain table tenk1;
explain ((table tenk1));
SELECT toplevel, calls, query FROM pg_stat_statements ORDER BY query
COLLATE "C", toplevel;
final output:

I already mentioned this at the end of [1]/messages/by-id/CACJufxF9hqyfmKEdpiG=PbrGdKVNP2BQjHFJh4q6639sV7NmvQ@mail.gmail.com.
Can you try to also normalize these cases, since we've normalized the
nested select query in explain statement.

[1]: /messages/by-id/CACJufxF9hqyfmKEdpiG=PbrGdKVNP2BQjHFJh4q6639sV7NmvQ@mail.gmail.com

#18

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: jian he (#17)

Re: Set query_id for query contained in utility statement

Hi,

On Fri, Oct 11, 2024 at 2:39 AM jian he <jian.universality@gmail.com> wrote:

in transformSelectStmt we do
makeNode(Query), assign Query's stmt_len, stmt_location value.
if in transformSelectStmt we did it wrong, then
transformTopLevelStmt

result->stmt_location = parseTree->stmt_location;
result->stmt_len = parseTree->stmt_len;

will override values we've previously assigned at transformSelectStmt.

this feel not safe?

Good point. There are multiple spots in the call tree where the
location/length was set which is definitely confusing. I've updated
the patch to always set the location/length within the transformStmt
calls, near the creation of the query nodes.
This means that transformSelectStmt was only doing a call
transformOptionalSelectInto and was mostly unnecessary. I've replaced
transformSelectStmt calls by direct calls to
transformOptionalSelectInto.

+qry->stmt_len = pstate->p_stmt_len - (stmt->location -
pstate->p_stmt_location);
i feel like, the comments didn't explain very well the difference between
stmt->location and pstate->p_stmt_location.
i know it's related to multi-statement, possibly through \;

I've added more details to the comments.

SELECT pg_stat_statements_reset() IS NOT NULL AS t;
explain (values (1));
explain ((values (1)));
explain table tenk1;
explain ((table tenk1));
SELECT toplevel, calls, query FROM pg_stat_statements ORDER BY query
COLLATE "C", toplevel;
final output:

toplevel | calls | query
----------+-------+--------------------------------------------------------------------------------------------
t | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
t | 0 | SELECT toplevel, calls, query FROM
pg_stat_statements ORDER BY query COLLATE "C", toplevel
t | 2 | explain (values ($1))
f | 2 | explain (values ($1));
f | 2 | explain table tenk1
t | 2 | explain table tenk1

I already mentioned this at the end of [1].
Can you try to also normalize these cases, since we've normalized the
nested select query in explain statement.

I've missed that, thanks for the reminder. Which made me realise that
the TABLE command was also not handled. I've added both TABLE and
VALUES in the tests and they should now report correctly when nested.

#19

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#18)

Re: Set query_id for query contained in utility statement

I've found that COPY wasn't correctly managed. "COPY (SELECT 1) to
stdout;" nested statement would be tracked as "SELECT $1) to stdout".
Since COPY accepts a PreparableStmt, stmt_len for all types of
PreparableStmt needs to be tracked. This is done through
updatePreparableStmtEnd in CopyStmt since we know the location of the
closing parenthesis. The logic to set Query->stmt_len for statements
that have a possible stmt_len has also been moved to a dedicated
setQueryStmtLen function.

I've updated the patchset with additional tests for COPY in 0001. 0002
includes the necessary modifications to handle COPY.

#20

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#19)

Re: Set query_id for query contained in utility statement

hi.

explain(verbose) SELECT 1, 2, 3\; explain SELECT 1, 2, 3, 4;
will transformed to
explain(verbose) SELECT 1, 2, 3; explain SELECT 1, 2, 3, 4;

it seems to me your patch care about following position.
explain(verbose) SELECT 1, 2, 3; explain SELECT 1, 2, 3, 4;
^

but this patch [1]/messages/by-id/2245576.1728418678@sss.pgh.pa.us at another thread will get the top level statement
(passed the raw parse, no syntax error) beginning more effortless.
explain(verbose) SELECT 1, 2, 3; explain SELECT 1, 2, 3, 4;
^ ^

can you try to looking at [1]/messages/by-id/2245576.1728418678@sss.pgh.pa.us. it may help to resolve this patch problem.

[1]: /messages/by-id/2245576.1728418678@sss.pgh.pa.us

#21

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: jian he (#20)

#22

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#21)