Regression in statement locations

Started by David Steele8 months ago12 messages
#1David Steele
david@pgbackrest.org

Hackers,

499edb0 introduced more precise locations for nested statements. In
general it works quite well and has made the pgAudit output much more
readable -- so kudos for that.

However, I have noticed one regression in the pgAudit tests. We have
this somewhat odd statement intentionally crafted for catch issues with
logging:

--
-- Test obfuscated dynamic sql for clean logging
DO $$
DECLARE
table_name TEXT = 'do_table';
BEGIN
EXECUTE E'\t\n\r CREATE TABLE ' || table_name || E' ("weird name"
INT)\t\n\r ; DROP table ' || table_name;
END $$;

In PG17 pgAudit logs the expected output:

NOTICE: AUDIT: SESSION,35,2,DDL,CREATE
TABLE,TABLE,public.do_table,"CREATE TABLE do_table (""weird name""
INT)",<none>
NOTICE: AUDIT: SESSION,35,3,DDL,DROP TABLE,TABLE,public.do_table,DROP
table do_table,<none>

But in PG18 we now get:

NOTICE: AUDIT: SESSION,35,2,DDL,CREATE
TABLE,TABLE,public.do_table,"CREATE TABLE do_table (""weird name""
INT)",<none>
NOTICE: AUDIT: SESSION,35,3,DDL,DROP
TABLE,TABLE,public.do_table,"CREATE TABLE do_table (""weird name"" INT)

; DROP table do_table",<none>

The create table statement is fine but the drop table statement has
create table glued to it. I tried removing the extra quotes, whitespace,
etc. from our test and the result is the same.

I had a look through 499edb0 and did not see anything obvious. It is
possible that I am missing a required change on my side but since the
commit did not make any code changes to pg_stat_statements I do not
think so. It is also possible that the regression is not coming from
499edb0 but I do not see another obvious candidate.

Thoughts?

Thanks,
-David

#2Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#1)
Re: Regression in statement locations

On Mon, May 19, 2025 at 02:31:24PM +0000, David Steele wrote:

But in PG18 we now get:

NOTICE: AUDIT: SESSION,35,2,DDL,CREATE TABLE,TABLE,public.do_table,"CREATE
TABLE do_table (""weird name"" INT)",<none>
NOTICE: AUDIT: SESSION,35,3,DDL,DROP TABLE,TABLE,public.do_table,"CREATE
TABLE do_table (""weird name"" INT)

; DROP table do_table",<none>

The create table statement is fine but the drop table statement has create
table glued to it. I tried removing the extra quotes, whitespace, etc. from
our test and the result is the same.

Yes, I've noticed this one while looking at the output generated by
pgaudit through the new logic when I've posted this message:
/messages/by-id/ZzqXU-2uTWzC38jO@paquier.xyz

If I remember what was on my mind back then (don't have this \r\n case
specifically written down on my notes from last November as far as I
can see now), I was thinking that this new output is expected as this
is the full query string for the nested statement passed down to the
EXECUTE clause done in this PL function. Anthonin, what's your take?

I am adding Jian in CC, who also participated on the thread.
--
Michael

#3Sami Imseih
samimseih@gmail.com
In reply to: Michael Paquier (#2)
Re: Regression in statement locations

It is also possible that the regression is not coming from
499edb0 but I do not see another obvious candidate.

I used pg_stat_statements to repro the issue, and a bisect
resulted in 499edb0 being the source of the regression.

```
select pg_stat_statements_reset();

set pg_stat_statements.track='all';
DO $$
DECLARE
BEGIN
EXECUTE 'CREATE TABLE do_table (weird_name INT); DROP table do_table';
END $$;
select query from pg_stat_statements where not toplevel;
```

WITHOUT a semicolon at the end of the statements, as reported by David

```
EXECUTE 'CREATE TABLE do_table (weird_name INT); DROP table do_table';
```

I can reproduce the issue

query
-------------------------------------------------------------
CREATE TABLE do_table (weird_name INT); DROP table do_table
CREATE TABLE do_table (weird_name INT)
(2 rows)

WITH a semicolon at the end of the statements
```
EXECUTE 'CREATE TABLE do_table (weird_name INT); DROP table do_table;';
```

I cannot reproduce the issue

query
----------------------------------------
DROP table do_table
CREATE TABLE do_table (weird_name INT)
(2 rows)

I am still not sure why this is the case, but wanted to share this
for now.

--
Sami Imseih
Amazon Web Services (AWS)

#4Michael Paquier
michael@paquier.xyz
In reply to: Sami Imseih (#3)
Re: Regression in statement locations

On Mon, May 19, 2025 at 05:10:14PM -0500, Sami Imseih wrote:

I am still not sure why this is the case, but wanted to share this
for now.

Hmm. Something seems to not be compiling well for the final query of
a stmtmulti in gram.y with updateRawStmtEnd(), as we rely on the
position of the semicolon to decide what the stmt_len should be with
multiple querues. We don't set the stmt_len without the semicolon,
causing pgss to reuse the full query length when storing the entry of
the last query.

With the semicolon in place, stmt_len gets set for the last query of
the string. Still digging more..
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#4)
1 attachment(s)
Re: Regression in statement locations

On Tue, May 20, 2025 at 08:38:47AM +0900, Michael Paquier wrote:

With the semicolon in place, stmt_len gets set for the last query of
the string. Still digging more..

And got it. The problem is that we are failing to update the
statement location in a couple of cases with subqueries, and that we
should handle (p_stmt_len == 0) as of using the remaining bytes in
the string when a location is available, but the code was too
aggressive in thinking that the length = 0 case should be always
discarded.

Once I have fixed that, I've been a bit puzzled by the difference in
output in the tests of pg_overexplain, but I think that the new output
is actually the correct one: the queries whose plan outputs have
changed are passed as arguments of the explain_filter() function,
hence the location of the inner queries point at the start location of
the inner query instead of the start of the top-level query. Note
that if you add a semicolon at the end of these three queries in the
pg_overexplain tests, we finish with an end location reported.

I have also played with 499edb0 reverted and noted that the results of
pg_overexplain were inconsistent when the module has been originally
introduced, with two queries choking a bit.
--
Michael

Attachments:

0001-Fix-regression-with-statement-location-of-some-subqu.patchtext/x-diff; charset=us-asciiDownload
From 74b484ee4b05b5ff7139bfc3ae1c194060a6dfaf Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 20 May 2025 12:53:35 +0900
Subject: [PATCH] Fix regression with statement location of some subqueries

Note the changes in the output of pg_overexplain, as the new locations
refer to the beginning of the argument passed to the function
explain_filter().

Blah.
---
 src/backend/parser/analyze.c                  | 20 +++----
 .../expected/pg_overexplain.out               |  8 +--
 .../expected/level_tracking.out               | 57 +++++++++++++++++++
 .../pg_stat_statements/sql/level_tracking.sql | 32 +++++++++++
 4 files changed, 102 insertions(+), 15 deletions(-)

diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 1f4d6adda524..bc524f0754ae 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -260,13 +260,6 @@ setQueryLocationAndLength(ParseState *pstate, Query *qry, Node *parseTree)
 {
 	ParseLoc	stmt_len = 0;
 
-	/*
-	 * If there is no information about the top RawStmt's length, leave it at
-	 * 0 to use the whole string.
-	 */
-	if (pstate->p_stmt_len == 0)
-		return;
-
 	switch (nodeTag(parseTree))
 	{
 		case T_InsertStmt:
@@ -308,16 +301,21 @@ setQueryLocationAndLength(ParseState *pstate, Query *qry, Node *parseTree)
 		/* Statement's length is known, use it */
 		qry->stmt_len = stmt_len;
 	}
-	else
+	else if (pstate->p_stmt_len > 0)
 	{
 		/*
-		 * Compute the statement's length from the statement's location and
-		 * the RawStmt's length and location.
+		 * The top RawStmt's length is known, so calculate the statement's
+		 * length from the statement's location and the RawStmt's length and
+		 * location.
 		 */
 		qry->stmt_len = pstate->p_stmt_len - (qry->stmt_location - pstate->p_stmt_location);
 	}
 
-	/* The calculated statement length should be calculated as positive. */
+	/*
+	 * The calculated statement length should be calculated as positive.  It
+	 * may be possible that stmt_len is 0, in which case the full remaining
+	 * string is used.
+	 */
 	Assert(qry->stmt_len >= 0);
 }
 
diff --git a/contrib/pg_overexplain/expected/pg_overexplain.out b/contrib/pg_overexplain/expected/pg_overexplain.out
index 44120c388af5..cb5c396c5192 100644
--- a/contrib/pg_overexplain/expected/pg_overexplain.out
+++ b/contrib/pg_overexplain/expected/pg_overexplain.out
@@ -119,7 +119,7 @@ $$);
    Subplans Needing Rewind: none
    Relation OIDs: NNN...
    Executor Parameter Types: none
-   Parse Location: 0 to end
+   Parse Location: 41 to end
  RTI 1 (relation, inherited, in-from-clause):
    Eref: vegetables (id, name, genus)
    Relation: vegetables
@@ -240,7 +240,7 @@ $$);
        <Subplans-Needing-Rewind>none</Subplans-Needing-Rewind>      +
        <Relation-OIDs>NNN...</Relation-OIDs>                        +
        <Executor-Parameter-Types>none</Executor-Parameter-Types>    +
-       <Parse-Location>0 to end</Parse-Location>                    +
+       <Parse-Location>53 to end</Parse-Location>                   +
      </PlannedStmt>                                                 +
      <Range-Table>                                                  +
        <Range-Table-Entry>                                          +
@@ -344,7 +344,7 @@ $$);
    Subplans Needing Rewind: none
    Relation OIDs: NNN...
    Executor Parameter Types: none
-   Parse Location: 0 to end
+   Parse Location: 28 to end
 (37 rows)
 
 SET debug_parallel_query = false;
@@ -372,7 +372,7 @@ $$);
    Subplans Needing Rewind: none
    Relation OIDs: NNN...
    Executor Parameter Types: 0
-   Parse Location: 0 to end
+   Parse Location: 28 to end
 (15 rows)
 
 -- Create an index, and then attempt to force a nested loop with inner index
diff --git a/contrib/pg_stat_statements/expected/level_tracking.out b/contrib/pg_stat_statements/expected/level_tracking.out
index 03bea14d5da0..ee6dea3a9cd3 100644
--- a/contrib/pg_stat_statements/expected/level_tracking.out
+++ b/contrib/pg_stat_statements/expected/level_tracking.out
@@ -1319,6 +1319,63 @@ SELECT toplevel, calls, query FROM pg_stat_statements
  t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
 (4 rows)
 
+-- DO block --- multiple inner queries with separators
+SET pg_stat_statements.track = 'all';
+SET pg_stat_statements.track_utility = TRUE;
+CREATE TABLE pgss_do_util_tab_1 (a int);
+CREATE TABLE pgss_do_util_tab_2 (a int);
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DO $$
+DECLARE BEGIN
+  EXECUTE 'CREATE TABLE pgss_do_table (weird_name INT); DROP TABLE pgss_do_table';
+  EXECUTE 'SELECT a FROM pgss_do_util_tab_1; SELECT a FROM pgss_do_util_tab_2';
+END $$;
+DO $$
+DECLARE BEGIN
+END $$;
+SELECT toplevel, calls, rows, query FROM pg_stat_statements
+  WHERE toplevel IS FALSE
+  ORDER BY query COLLATE "C";
+ toplevel | calls | rows |                    query                    
+----------+-------+------+---------------------------------------------
+ f        |     1 |    0 | CREATE TABLE pgss_do_table (weird_name INT)
+ f        |     1 |    0 | DROP TABLE pgss_do_table
+ f        |     1 |    0 | SELECT a FROM pgss_do_util_tab_1
+ f        |     1 |    0 | SELECT a FROM pgss_do_util_tab_2
+(4 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+-- Note the extra semicolon at the end of the query.
+DO $$
+DECLARE BEGIN
+  EXECUTE 'CREATE TABLE pgss_do_table (weird_name INT); DROP TABLE pgss_do_table;';
+  EXECUTE 'SELECT a FROM pgss_do_util_tab_1; SELECT a FROM pgss_do_util_tab_2;';
+END $$;
+DO $$
+DECLARE BEGIN
+END $$;
+SELECT toplevel, calls, rows, query FROM pg_stat_statements
+  WHERE toplevel IS FALSE
+  ORDER BY query COLLATE "C";
+ toplevel | calls | rows |                    query                    
+----------+-------+------+---------------------------------------------
+ f        |     1 |    0 | CREATE TABLE pgss_do_table (weird_name INT)
+ f        |     1 |    0 | DROP TABLE pgss_do_table
+ f        |     1 |    0 | SELECT a FROM pgss_do_util_tab_1
+ f        |     1 |    0 | SELECT a FROM pgss_do_util_tab_2
+(4 rows)
+
+DROP TABLE pgss_do_util_tab_1, pgss_do_util_tab_2;
 -- PL/pgSQL function - top-level tracking.
 SET pg_stat_statements.track = 'top';
 SET pg_stat_statements.track_utility = FALSE;
diff --git a/contrib/pg_stat_statements/sql/level_tracking.sql b/contrib/pg_stat_statements/sql/level_tracking.sql
index 6b81230f1869..553059816915 100644
--- a/contrib/pg_stat_statements/sql/level_tracking.sql
+++ b/contrib/pg_stat_statements/sql/level_tracking.sql
@@ -334,6 +334,38 @@ END; $$;
 SELECT toplevel, calls, query FROM pg_stat_statements
   ORDER BY query COLLATE "C", toplevel;
 
+-- DO block --- multiple inner queries with separators
+SET pg_stat_statements.track = 'all';
+SET pg_stat_statements.track_utility = TRUE;
+CREATE TABLE pgss_do_util_tab_1 (a int);
+CREATE TABLE pgss_do_util_tab_2 (a int);
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DO $$
+DECLARE BEGIN
+  EXECUTE 'CREATE TABLE pgss_do_table (weird_name INT); DROP TABLE pgss_do_table';
+  EXECUTE 'SELECT a FROM pgss_do_util_tab_1; SELECT a FROM pgss_do_util_tab_2';
+END $$;
+DO $$
+DECLARE BEGIN
+END $$;
+SELECT toplevel, calls, rows, query FROM pg_stat_statements
+  WHERE toplevel IS FALSE
+  ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+-- Note the extra semicolon at the end of the query.
+DO $$
+DECLARE BEGIN
+  EXECUTE 'CREATE TABLE pgss_do_table (weird_name INT); DROP TABLE pgss_do_table;';
+  EXECUTE 'SELECT a FROM pgss_do_util_tab_1; SELECT a FROM pgss_do_util_tab_2;';
+END $$;
+DO $$
+DECLARE BEGIN
+END $$;
+SELECT toplevel, calls, rows, query FROM pg_stat_statements
+  WHERE toplevel IS FALSE
+  ORDER BY query COLLATE "C";
+DROP TABLE pgss_do_util_tab_1, pgss_do_util_tab_2;
+
 -- PL/pgSQL function - top-level tracking.
 SET pg_stat_statements.track = 'top';
 SET pg_stat_statements.track_utility = FALSE;
-- 
2.49.0

#6jian he
jian.universality@gmail.com
In reply to: Michael Paquier (#5)
Re: Regression in statement locations

On Tue, May 20, 2025 at 11:59 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, May 20, 2025 at 08:38:47AM +0900, Michael Paquier wrote:

With the semicolon in place, stmt_len gets set for the last query of
the string. Still digging more..

And got it. The problem is that we are failing to update the
statement location in a couple of cases with subqueries, and that we
should handle (p_stmt_len == 0) as of using the remaining bytes in
the string when a location is available, but the code was too
aggressive in thinking that the length = 0 case should be always
discarded.

I was thinking of using strlen(pstate->p_sourcetext) - qry->stmt_location;
in setQueryLocationAndLength,
then I saw your changes in the pg_overexplain module.

maybe put some comments on top of setQueryLocationAndLength
for (qry->stmt_len == 0) case, but i saw the comments on the bottom
of setQueryLocationAndLength.

Overall, it looks good to me.

#7Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: jian he (#6)
Re: Regression in statement locations

On Tue, May 20, 2025 at 5:59 AM Michael Paquier <michael@paquier.xyz> wrote:

Once I have fixed that, I've been a bit puzzled by the difference in
output in the tests of pg_overexplain, but I think that the new output
is actually the correct one: the queries whose plan outputs have
changed are passed as arguments of the explain_filter() function,
hence the location of the inner queries point at the start location of
the inner query instead of the start of the top-level query. Note
that if you add a semicolon at the end of these three queries in the
pg_overexplain tests, we finish with an end location reported.

Indeed, by dumping more details on the query string and parsed string
in pg_overexplain, the reported location does match the inner SELECT.
This seems appropriate since it is for the planned select statement.

Executor Parameter Types: none
Query String: EXPLAIN (DEBUG, RANGE_TABLE, COSTS OFF) SELECT
genus, array_agg(name ORDER BY name) FROM vegetables GROUP BY genus
Parse Location: 41 to end
Parsed String: SELECT genus, array_agg(name ORDER BY name) FROM
vegetables GROUP BY genus

Looking at the tests, there are 2 additionals empty DO blocks:
+DO $$
+DECLARE BEGIN
+END $$;

What's the point of those? They won't be visible in the output since
we have 'toplevel IS FALSE' in the pg_stat_statements calls and they
don't fit the "DO block --- multiple inner queries with separators".

Other than that, the patch looks good.

#8Michael Paquier
michael@paquier.xyz
In reply to: Anthonin Bonnefoy (#7)
Re: Regression in statement locations

On Tue, May 20, 2025 at 09:58:04AM +0200, Anthonin Bonnefoy wrote:

Looking at the tests, there are 2 additionals empty DO blocks:
+DO $$
+DECLARE BEGIN
+END $$;

What's the point of those? They won't be visible in the output since
we have 'toplevel IS FALSE' in the pg_stat_statements calls and they
don't fit the "DO block --- multiple inner queries with separators".

That's a copy-pasto. Will remove.

Other than that, the patch looks good.

Thanks for the review, Anthonin and Jian.
--
Michael

#9Sami Imseih
samimseih@gmail.com
In reply to: Michael Paquier (#8)
Re: Regression in statement locations

Tested the patch and it looks good to me.

Not that I thought it would fail, but I also confirmed the pgaudit case
works as expected.

```
LOG: AUDIT: SESSION,10,2,DDL,CREATE TABLE,,,"CREATE TABLE do_table
(""weird name""
INT)",<not logged>
LOG: AUDIT: SESSION,10,3,DDL,DROP TABLE,,,DROP table do_table,<not logged>
DO
```

--
Sami

#10David Steele
david@pgbackrest.org
In reply to: Sami Imseih (#9)
Re: Regression in statement locations

On 5/20/25 07:34, Sami Imseih wrote:

Tested the patch and it looks good to me.

Not that I thought it would fail, but I also confirmed the pgaudit case
works as expected.

I also tested and everything looks good with the patch.

I did a careful examination of the remaining diffs (there are quite a
few) and in every case I consider them to be beneficial, i.e. they make
the output more targeted and readable.

I did not do a real code review, but I did notice that the test table
column is called weird_name as in our tests. I would argue that since it
is missing the quotes and space it is not really all that weird and
should maybe get a normal name so developers in the future don't wonder
what is weird about it.

Thank you for this improvement and the quick fix!

Regards,
-David

#11Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#10)
Re: Regression in statement locations

On Tue, May 20, 2025 at 02:04:20PM +0000, David Steele wrote:

I did a careful examination of the remaining diffs (there are quite a few)
and in every case I consider them to be beneficial, i.e. they make the
output more targeted and readable.

I did not do a real code review, but I did notice that the test table column
is called weird_name as in our tests. I would argue that since it is missing
the quotes and space it is not really all that weird and should maybe get a
normal name so developers in the future don't wonder what is weird about it.

I have fixed that, as it is not a weird attribute, removed the
unnedeed DO blocks in the tests as pointed out by Anthonin, and moved
one comment as pointed out by Jian.

Thank you for this improvement and the quick fix!

Yeah, thanks all for pointing out that sometimes my analysis of things
can go off tracks.

The fix has been applied now on HEAD. I've also checked the state of
pgaudit on branch dev-pg18, with the regression getting fixed. Things
look clear now, at least from my side.
--
Michael

#12David Steele
david@pgbackrest.org
In reply to: Michael Paquier (#11)
Re: Regression in statement locations

On 5/20/25 21:28, Michael Paquier wrote:

The fix has been applied now on HEAD. I've also checked the state of
pgaudit on branch dev-pg18, with the regression getting fixed. Things
look clear now, at least from my side.

Just retested and it looks good from my side, too. Thanks again!

-David