Generating code for query jumbling through gen_node_support.pl

Started by Michael Paquierover 3 years ago44 messageshackers
Jump to latest
#1Michael Paquier
michael@paquier.xyz

Hi all,

This thread is a follow-up of the recent discussion about query
jumbling with DDL statements, where the conclusion was that we'd want
to generate all this code automatically for all the nodes:
/messages/by-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com

What this patch allows to do it to compute the same query ID for
utility statements using their parsed Node state instead of their
string, meaning that things like "BEGIN", "bEGIN" or "begin" would be
treated the same, for example. But the main idea is not only that.

I have implemented that as of the attached, where the following things
are done:
- queryjumble.c is moved to src/backend/nodes/, to stick with the
other things for node equal/read/write/copy, renamed to
jumblefuncs.c.
- gen_node_support.c is extended to generate the functions and the
switch for the jumbling. There are a few exceptions, as of the Lists
and RangeTblEntry to do the jumbling consistently.
- Two pg_node_attr() are added in consistency with the existing ones:
no_jumble to discard completely a node from the the query jumbling
and jumble_ignore to discard one field from the jumble.

The patch is in a rather good shape, passes check-world and the CI,
but there are a few things that need to be discussed IMO. Things
could be perhaps divided in more patches, now the areas touched are
quite different so it did not look like a big deal to me as the
changes touch different areas and are straight-forward.

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not. There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

Note that the plan is to extend the normalization to some other parts
of the Nodes, like CALL and SET as mentioned on the other thread. I
have done nothing about that yet but doing so can be done in a few
lines with the facility presented here (aka just add a location
field). Hence, the normalization is consistent with the existing
queryjumble.c for the fields and the nodes processed.

In this patch, things are done so as the query ID is not computed
anymore from the query string but from the Query. I still need to
study the performance impact of that with short queries. If things
prove to be noticeable in some cases, this stuff could be switched to
use a new GUC where we could have a code path for the computation of
utilityStmt using its string as a fallback. I am not sure that I want
to enter in this level of complications, though, to keep things
simple, but that's yet to be done.

A bit more could be cut but pg_ident got in the way.. There are also
a few things for pg_stat_statements where a query ID of 0 can be
implied for utility statements in some cases.

Generating this code leads to an overall removal of code as what
queryjumble.c is generated automatically:
13 files changed, 901 insertions(+), 1113 deletions(-)

I am adding that to the next commit fest.

Thoughts?
--
Michael

Attachments:

0001-Support-for-automated-query-jumble-with-all-Nodes.patchtext/x-diff; charset=us-asciiDownload+901-1114
#2vignesh C
vignesh21@gmail.com
In reply to: Michael Paquier (#1)
Re: Generating code for query jumbling through gen_node_support.pl

On Wed, 7 Dec 2022 at 13:27, Michael Paquier <michael@paquier.xyz> wrote:

Hi all,

This thread is a follow-up of the recent discussion about query
jumbling with DDL statements, where the conclusion was that we'd want
to generate all this code automatically for all the nodes:
/messages/by-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com

What this patch allows to do it to compute the same query ID for
utility statements using their parsed Node state instead of their
string, meaning that things like "BEGIN", "bEGIN" or "begin" would be
treated the same, for example. But the main idea is not only that.

I have implemented that as of the attached, where the following things
are done:
- queryjumble.c is moved to src/backend/nodes/, to stick with the
other things for node equal/read/write/copy, renamed to
jumblefuncs.c.
- gen_node_support.c is extended to generate the functions and the
switch for the jumbling. There are a few exceptions, as of the Lists
and RangeTblEntry to do the jumbling consistently.
- Two pg_node_attr() are added in consistency with the existing ones:
no_jumble to discard completely a node from the the query jumbling
and jumble_ignore to discard one field from the jumble.

The patch is in a rather good shape, passes check-world and the CI,
but there are a few things that need to be discussed IMO. Things
could be perhaps divided in more patches, now the areas touched are
quite different so it did not look like a big deal to me as the
changes touch different areas and are straight-forward.

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not. There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

Note that the plan is to extend the normalization to some other parts
of the Nodes, like CALL and SET as mentioned on the other thread. I
have done nothing about that yet but doing so can be done in a few
lines with the facility presented here (aka just add a location
field). Hence, the normalization is consistent with the existing
queryjumble.c for the fields and the nodes processed.

In this patch, things are done so as the query ID is not computed
anymore from the query string but from the Query. I still need to
study the performance impact of that with short queries. If things
prove to be noticeable in some cases, this stuff could be switched to
use a new GUC where we could have a code path for the computation of
utilityStmt using its string as a fallback. I am not sure that I want
to enter in this level of complications, though, to keep things
simple, but that's yet to be done.

A bit more could be cut but pg_ident got in the way.. There are also
a few things for pg_stat_statements where a query ID of 0 can be
implied for utility statements in some cases.

Generating this code leads to an overall removal of code as what
queryjumble.c is generated automatically:
13 files changed, 901 insertions(+), 1113 deletions(-)

I am adding that to the next commit fest.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_4047.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
b82557ecc2ebbf649142740a1c5ce8d19089f620 ===
=== applying patch
./0001-Support-for-automated-query-jumble-with-all-Nodes.patch
...
patching file src/backend/utils/misc/queryjumble.c
Hunk #1 FAILED at 1.
Not deleting file src/backend/utils/misc/queryjumble.c as content
differs from patch
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/utils/misc/queryjumble.c.rej

[1]: http://cfbot.cputube.org/patch_41_4047.log

Regards,
Vignesh

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#1)
Re: Generating code for query jumbling through gen_node_support.pl

On 07.12.22 08:56, Michael Paquier wrote:

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not.

The generation script already has a way to identify location fields, by
($t eq 'int' && $f =~ 'location$'), so you could use that as well.

There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

I'm concerned about the large number of additional field annotations
this adds. We have been careful so far to document the use of each
attribute, e.g., *why* does a field not need to be copied etc. This
patch adds dozens and dozens of annotations without any explanation at
all. Now, the code this replaces also has no documentation, but maybe
this is the time to add some.

#4Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#3)
Re: Generating code for query jumbling through gen_node_support.pl

On Sat, Jan 07, 2023 at 07:37:49AM +0100, Peter Eisentraut wrote:

The generation script already has a way to identify location fields, by ($t
eq 'int' && $f =~ 'location$'), so you could use that as well.

I recall that some of the nodes may need renames to map with this
choice. That could be just one patch on top of the actual feature.

I'm concerned about the large number of additional field annotations this
adds. We have been careful so far to document the use of each attribute,
e.g., *why* does a field not need to be copied etc. This patch adds dozens
and dozens of annotations without any explanation at all. Now, the code
this replaces also has no documentation, but maybe this is the time to add
some.

In most cases, the addition of the node marker would be enough to
self-explain why they are included, but there is a trend for a lot of
the nodes when it comes to collations and typmods where we don't want
to see these in the jumbling calculations. I'll look at providing
more info for all that. (Note: I'm out for now.)
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#3)
Re: Generating code for query jumbling through gen_node_support.pl

On Sat, Jan 07, 2023 at 07:37:49AM +0100, Peter Eisentraut wrote:

On 07.12.22 08:56, Michael Paquier wrote:

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not.

The generation script already has a way to identify location fields, by ($t
eq 'int' && $f =~ 'location$'), so you could use that as well.

I did not recall exactly everything here, but there are two parts to
the logic:
- gen_node_support.pl uses exactly this condition when scanning the
nodes to put the correct macro to mark a location to track, calling
down RecordConstLocation().
- Marking a bunch of nodes as jumble_ignore is actually necessary, or
we may finish by silencing parts of queries that should be
semantically unrelevant to the queries jumbled (ColumnRef is one).
Using a "jumble_ignore" flag is equally invasive to using an
"jumble_include" flag for each field, I am afraid, as the number of
fields in the nodes included in jumbles is pretty equivalent to the
number of fields ignored. I tend to prefer the approach of ignoring
things though, which is more consistent with the practive for node
read, write and copy.

Anyway, when it comes to the location, another thing that can be
considered here would be to require a node-level flag for the nodes on
which we want to track the location.  This overlaps a bit with the
fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
most of the code changes like this one as at the end we only care
about the location for Const nodes:
-   int         location;       /* token location, or -1 if unknown */
+   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
+                                                        * if unknown */

I have taken this approach in v2 of the patch, shaving ~100 lines of
more code as there is no need to mark all these location fields with
"jumble_ignore" anymore, except for Const, of course.

There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

I'm concerned about the large number of additional field annotations this
adds. We have been careful so far to document the use of each attribute,
e.g., *why* does a field not need to be copied etc. This patch adds dozens
and dozens of annotations without any explanation at all. Now, the code
this replaces also has no documentation, but maybe this is the time to add
some.

That's fair, though it is not doing to buy us much to update all the
nodes with similar small comments, as well. As far as I know, there
are basiscally three things here: typmods, collation information, and
internal data of the nodes stored during parse-analyze. I have added
more documentation to track what looks like the most relevant areas.

I have begun running some performance tests with this stuff and HEAD
to see if this leads to any difference in the query ID compilation
(compute_query_id = on, on scissors roughly) with a simple set of
short commands (like BEGIN/COMMIT) or longer ones, and I am seeing a
speedup trend actually (?). I still need to think more about a set of
tests here, but I think that micro-benchmarking of JumbleQuery() is
the most adapted approach to minimize the noise, with a few nodes of
various sizes (Const, Query, ColumnRef, anything..).

Thoughts?
--
Michael

Attachments:

v2-0001-Support-for-automated-query-jumble-with-all-Nodes.patchtext/x-diff; charset=us-asciiDownload+843-1051
#6Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#5)
Re: Generating code for query jumbling through gen_node_support.pl

On 13.01.23 08:54, Michael Paquier wrote:

Using a "jumble_ignore" flag is equally invasive to using an
"jumble_include" flag for each field, I am afraid, as the number of
fields in the nodes included in jumbles is pretty equivalent to the
number of fields ignored. I tend to prefer the approach of ignoring
things though, which is more consistent with the practive for node
read, write and copy.

I concur that jumble_ignore is better. I suppose you placed the
jumble_ignore markers to maintain parity with the existing code, but I
think that some the markers are actually wrong and are just errors of
omission in the existing code (such as Query.override, to pick a random
example). The way you have structured this would allow us to find and
analyze such errors better.

Anyway, when it comes to the location, another thing that can be
considered here would be to require a node-level flag for the nodes on
which we want to track the location.  This overlaps a bit with the
fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
most of the code changes like this one as at the end we only care
about the location for Const nodes:
-   int         location;       /* token location, or -1 if unknown */
+   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
+                                                        * if unknown */

I have taken this approach in v2 of the patch, shaving ~100 lines of
more code as there is no need to mark all these location fields with
"jumble_ignore" anymore, except for Const, of course.

I don't understand why you chose that when we already have an
established way. This would just make the jumble annotations
inconsistent with the other annotations.

I also have two suggestions to make this patch more palatable:

1. Make a separate patch to reformat long comments, like
835d476fd21bcfb60b055941dee8c3d9559af14c.

2. Make a separate patch to move the jumble support to
src/backend/nodes/jumblefuncs.c. (It was probably a mistake that it
wasn't there to begin with, and some of the errors of omission alluded
to above are probably caused by it having been hidden away in the wrong
place.)

#7Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#6)
Re: Generating code for query jumbling through gen_node_support.pl

On Mon, Jan 16, 2023 at 03:13:35PM +0100, Peter Eisentraut wrote:

On 13.01.23 08:54, Michael Paquier wrote:

Using a "jumble_ignore" flag is equally invasive to using an
"jumble_include" flag for each field, I am afraid, as the number of
fields in the nodes included in jumbles is pretty equivalent to the
number of fields ignored. I tend to prefer the approach of ignoring
things though, which is more consistent with the practive for node
read, write and copy.

I concur that jumble_ignore is better. I suppose you placed the
jumble_ignore markers to maintain parity with the existing code, but I think
that some the markers are actually wrong and are just errors of omission in
the existing code (such as Query.override, to pick a random example). The
way you have structured this would allow us to find and analyze such errors
better.

Thanks. Yes, I have made an effort of going down to maintain an exact
compatibility with the existing code for now. My take is that
removing or adding more things into the jumble deserves its own
discussion. I think that's a bit better once this code is automated
with the nodes, now it would not be difficult either to adjust HEAD
depending on that. Only the analysis effort is different.

Anyway, when it comes to the location, another thing that can be
considered here would be to require a node-level flag for the nodes on
which we want to track the location.  This overlaps a bit with the
fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
most of the code changes like this one as at the end we only care
about the location for Const nodes:
-   int         location;       /* token location, or -1 if unknown */
+   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
+                                                        * if unknown */

I have taken this approach in v2 of the patch, shaving ~100 lines of
more code as there is no need to mark all these location fields with
"jumble_ignore" anymore, except for Const, of course.

I don't understand why you chose that when we already have an established
way. This would just make the jumble annotations inconsistent with the
other annotations.

Because we don't want to track the location of all the nodes! If we
do that, pg_stat_statements would begin to parameterize a lot more
things, from column aliases to full contents of IN or WITH clauses.
The root point is that we only want to track the jumble location for
Const nodes now. In order to do that, there are two approaches:
- Mark all the locations with jumble_ignore: more invasive, but
it requires only one per-field attribute with "jumble_ignore". This
is what v1 did.
- Mark only the fields where we want to track the location with a
second node type, like "jumble_location". We could restrict that
depending on the field type or its name on top of checking the field
attribute, whatever. This is what v2 did.

Which approach do you prefer? Marking all the locations we don't want
with jumble_ignore, or introduce a second attribute with
jumble_location. I'd tend to prefer the approach of minimizing the
number of node and field attributes, FWIW. Now you have worked on
this area previously, so your view may be more adapted than my
thinking process

The long-term perspective is that I'd like to extend the tracking of
the locations to more DDL nodes, like parameters of SET statements or
parts of CALL statements. Not to mention that this makes the work of
forks easier. This is a separate discussion.

I also have two suggestions to make this patch more palatable:

1. Make a separate patch to reformat long comments, like
835d476fd21bcfb60b055941dee8c3d9559af14c.

2. Make a separate patch to move the jumble support to
src/backend/nodes/jumblefuncs.c. (It was probably a mistake that it wasn't
there to begin with, and some of the errors of omission alluded to above are
probably caused by it having been hidden away in the wrong place.)

Both suggestions make sense. I'll shape the next series of the patch
to do something among these lines.

My question about the location tracking greatly influences the first
patch (comment reformating) and third patch (automated generation of
the code) of the series, so do you have a preference about that?
--
Michael

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#7)
Re: Generating code for query jumbling through gen_node_support.pl

On 17.01.23 04:48, Michael Paquier wrote:

Anyway, when it comes to the location, another thing that can be
considered here would be to require a node-level flag for the nodes on
which we want to track the location.  This overlaps a bit with the
fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
most of the code changes like this one as at the end we only care
about the location for Const nodes:
-   int         location;       /* token location, or -1 if unknown */
+   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
+                                                        * if unknown */

I have taken this approach in v2 of the patch, shaving ~100 lines of
more code as there is no need to mark all these location fields with
"jumble_ignore" anymore, except for Const, of course.

I don't understand why you chose that when we already have an established
way. This would just make the jumble annotations inconsistent with the
other annotations.

Because we don't want to track the location of all the nodes! If we
do that, pg_stat_statements would begin to parameterize a lot more
things, from column aliases to full contents of IN or WITH clauses.
The root point is that we only want to track the jumble location for
Const nodes now. In order to do that, there are two approaches:
- Mark all the locations with jumble_ignore: more invasive, but
it requires only one per-field attribute with "jumble_ignore". This
is what v1 did.

Ok, I understand now, and I agree with this approach over the opposite.
I was confused because the snippet you showed above used
"jumble_ignore", but your patch is correct as it uses "jumble_location".

That said, the term "jumble" is really weird, because in the sense that
we are using it here it means, approximately, "to mix together", "to
unify". So what we are doing with the Const nodes is really to *not*
jumble the location, but for all other node types we are jumbling the
location. At least that is my understanding.

#9Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#8)
Re: Generating code for query jumbling through gen_node_support.pl

On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote:

Ok, I understand now, and I agree with this approach over the opposite. I
was confused because the snippet you showed above used "jumble_ignore", but
your patch is correct as it uses "jumble_location".

Okay. I'll refresh the patch set so as we have only "jumble_ignore",
then, like v1, with preparatory patches for what you mentioned and
anything that comes into mind.

That said, the term "jumble" is really weird, because in the sense that we
are using it here it means, approximately, "to mix together", "to unify".
So what we are doing with the Const nodes is really to *not* jumble the
location, but for all other node types we are jumbling the location. At
least that is my understanding.

I am quite familiar with this term, FWIW. That's what we've inherited
from the days where this has been introduced in pg_stat_statements.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#9)
Re: Generating code for query jumbling through gen_node_support.pl

On Tue, Jan 17, 2023 at 04:52:28PM +0900, Michael Paquier wrote:

On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote:

Ok, I understand now, and I agree with this approach over the opposite. I
was confused because the snippet you showed above used "jumble_ignore", but
your patch is correct as it uses "jumble_location".

Okay. I'll refresh the patch set so as we have only "jumble_ignore",
then, like v1, with preparatory patches for what you mentioned and
anything that comes into mind.

This is done as of the patch series v3 attached:
- 0001 reformats all the comments of the nodes.
- 0002 moves the current files for query jumble as of queryjumble.c ->
queryjumblefuncs.c and utils/queryjumble.h -> nodes/queryjumble.h.
- 0003 is the core feature, where I have done a second pass over the
nodes to make sure that things map with HEAD, incorporating the extra
docs coming from v2, adding a bit more.

That said, the term "jumble" is really weird, because in the sense that we
are using it here it means, approximately, "to mix together", "to unify".
So what we are doing with the Const nodes is really to *not* jumble the
location, but for all other node types we are jumbling the location. At
least that is my understanding.

I am quite familiar with this term, FWIW. That's what we've inherited
from the days where this has been introduced in pg_stat_statements.

I have renamed the node attributes to query_jumble_ignore and
no_query_jumble at the end, after considering Peter's point that only
"jumble" could be fuzzy here. The file names are changed in
consequence.

While doing all that, I have done some micro-benchmarking of
JumbleQuery(), making it loop 50M times on my laptop each time a query
ID is computed (hideous hack with a loop in queryjumble.c):
- For non-utility queries, aka queries that go through
JumbleQueryInternal(), I am measuring a repeatable ~10% improvement
with the generated code over HEAD, which is kind of nice. I have
tested a few DMLs and simple SELECTs, still it looks like a trend.
- For utility queries, the new code is competing against
hash_any_extended() with the query string, which is going to be hard
to beat.. I am measuring what looks like a 5 times slowdown, at
least, and more depending on the depth of the query tree. That's not
surprising. On a 50M loop, this comes down to compare a computation
of 100ns to 5ns for a 20-time slowdown for example, still this could
justify the addition of a GUC to control whether utility queries have
their query ID compiled depending on their nodes or their string
hash, as this could become noticeable in OLTP workloads with loads
of short statements and BEGIN/COMMIT queries?

Thoughts or comments?
--
Michael

Attachments:

v3-0001-Rework-format-of-comments-for-nodes.patchtext/x-diff; charset=us-asciiDownload+487-247
v3-0002-Move-query-jumble-code-to-src-backend-nodes.patchtext/x-diff; charset=us-asciiDownload+11-14
v3-0003-Support-for-automated-query-jumble-with-all-Nodes.patchtext/x-diff; charset=us-asciiDownload+561-924
#11Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#10)
Re: Generating code for query jumbling through gen_node_support.pl

On 18.01.23 08:04, Michael Paquier wrote:

On Tue, Jan 17, 2023 at 04:52:28PM +0900, Michael Paquier wrote:

On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote:

Ok, I understand now, and I agree with this approach over the opposite. I
was confused because the snippet you showed above used "jumble_ignore", but
your patch is correct as it uses "jumble_location".

Okay. I'll refresh the patch set so as we have only "jumble_ignore",
then, like v1, with preparatory patches for what you mentioned and
anything that comes into mind.

This is done as of the patch series v3 attached:
- 0001 reformats all the comments of the nodes.
- 0002 moves the current files for query jumble as of queryjumble.c ->
queryjumblefuncs.c and utils/queryjumble.h -> nodes/queryjumble.h.
- 0003 is the core feature, where I have done a second pass over the
nodes to make sure that things map with HEAD, incorporating the extra
docs coming from v2, adding a bit more.

This patch structure looks good.

That said, the term "jumble" is really weird, because in the sense that we
are using it here it means, approximately, "to mix together", "to unify".
So what we are doing with the Const nodes is really to *not* jumble the
location, but for all other node types we are jumbling the location. At
least that is my understanding.

I am quite familiar with this term, FWIW. That's what we've inherited
from the days where this has been introduced in pg_stat_statements.

I have renamed the node attributes to query_jumble_ignore and
no_query_jumble at the end, after considering Peter's point that only
"jumble" could be fuzzy here. The file names are changed in
consequence.

I see that in the 0003 patch, most location fields now have an explicit
markup with query_jumble_ignore. I thought we had previously resolved
to consider location fields to be automatically ignored unless
explicitly included (like for the Const node). This appears to invert
that? Am I missing something?

#12Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#11)
Re: Generating code for query jumbling through gen_node_support.pl

On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote:

I see that in the 0003 patch, most location fields now have an explicit
markup with query_jumble_ignore. I thought we had previously resolved to
consider location fields to be automatically ignored unless explicitly
included (like for the Const node). This appears to invert that? Am I
missing something?

My misunderstanding then, I thought that you were OK with what was
part of v1, where all these fields was marked as "ignore". But you
actually prefer v2, with the second field "location" on top of
"ignore". I can update 0003 to refresh that.

Would you be OK if I apply 0001 (with the comments of the locations
still reshaped to ease future property additions) and 0002?
--
Michael

#13Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#11)
Re: Generating code for query jumbling through gen_node_support.pl

On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote:

I see that in the 0003 patch, most location fields now have an explicit
markup with query_jumble_ignore. I thought we had previously resolved to
consider location fields to be automatically ignored unless explicitly
included (like for the Const node). This appears to invert that? Am I
missing something?

As a result, I have rebased the patch set to use the two-attribute
approach: query_jumble_ignore and query_jumble_location.

On top of the three previous patches, I am adding 0004 to implement a
GUC able to switch the computation of the utility statements between
what I am calling "string" to compute the query IDs based on the hash
of the query string and the previous default, or "jumble", to use the
parsed tree, with a few more tests to see the difference. Perhaps it
is not worth bothering, but it could be possible that some users don't
want to pay the penalty of doing the query jumbling with the parsed
tree for utilities, as well..
--
Michael

Attachments:

v4-0001-Rework-format-of-comments-for-nodes.patchtext/x-diff; charset=us-asciiDownload+487-247
v4-0002-Move-query-jumble-code-to-src-backend-nodes.patchtext/x-diff; charset=us-asciiDownload+11-14
v4-0003-Support-for-automated-query-jumble-with-all-Nodes.patchtext/x-diff; charset=us-asciiDownload+509-859
v4-0004-Add-GUC-utility_query_id.patchtext/x-diff; charset=us-asciiDownload+151-23
#14Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#13)
Re: Generating code for query jumbling through gen_node_support.pl

On 20.01.23 05:35, Michael Paquier wrote:

On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote:

I see that in the 0003 patch, most location fields now have an explicit
markup with query_jumble_ignore. I thought we had previously resolved to
consider location fields to be automatically ignored unless explicitly
included (like for the Const node). This appears to invert that? Am I
missing something?

As a result, I have rebased the patch set to use the two-attribute
approach: query_jumble_ignore and query_jumble_location.

Structurally, this looks okay to me now.

In your 0001 patch, most of the comment reformattings for location
fields are no longer needed, so you should undo those.

The 0002 patch looks good.

Those two could be committed with those adjustments, I think.

I'll read the 0003 again more carefully. I haven't studied the new 0004
yet.

#15Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#14)
Re: Generating code for query jumbling through gen_node_support.pl

On Fri, Jan 20, 2023 at 11:56:00AM +0100, Peter Eisentraut wrote:

In your 0001 patch, most of the comment reformattings for location fields
are no longer needed, so you should undo those.

The 0002 patch looks good.

Okay, I have gone through these two again and applied what I had.
0001 has been cleaned up of the extra comment moves for the
locations. Now, I have kept a few changes for some of the nodes to
have some consistency with the other fields, in the case where most of
the fields at the end of the structures have to be marked with new
node attributes. This made the style of the header a bit more
elegant, IMV.

I'll read the 0003 again more carefully. I haven't studied the new 0004
yet.

Thanks, again. Rebased version attached.
--
Michael

Attachments:

v5-0003-Support-for-automated-query-jumble-with-all-Nodes.patchtext/x-diff; charset=us-asciiDownload+509-859
v5-0004-Add-GUC-utility_query_id.patchtext/x-diff; charset=us-asciiDownload+151-23
#16Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#15)
Re: Generating code for query jumbling through gen_node_support.pl

On 21.01.23 04:35, Michael Paquier wrote:

I'll read the 0003 again more carefully. I haven't studied the new 0004
yet.

Thanks, again. Rebased version attached.

A couple of small fixes are attached.

There is something weird in _jumbleNode(). There are two switch
(nodeTag(expr)) statements. Maybe that's intentional, but then it
should be commented better, because now it looks more like an editing
mistake.

The handling of T_RangeTblEntry could be improved. In other contexts we
have for example a custom_copy attribute, which generates the switch
entry but not the function. Something like this could be useful here too.

Otherwise, this looks ok. I haven't checked whether it maintains the
exact behavior from before. What is the test coverage situation for this?

For the 0004 patch, it should be documented why one would want one
behavior or the other. That's totally unclear right now.

I think if we are going to accept 0004, then it might be better to
combine it with 0003. Otherwise, 0004 is just undoing a lot of the code
structure changes in JumbleQuery() that 0003 did.

Attachments:

0001-fixup-Support-for-automated-query-jumble-with-all-No.patchtext/plain; charset=UTF-8; name=0001-fixup-Support-for-automated-query-jumble-with-all-No.patchDownload+3-5
#17Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#16)
Re: Generating code for query jumbling through gen_node_support.pl

On Mon, Jan 23, 2023 at 02:27:13PM +0100, Peter Eisentraut wrote:

A couple of small fixes are attached.

Thanks.

There is something weird in _jumbleNode(). There are two switch
(nodeTag(expr)) statements. Maybe that's intentional, but then it should be
commented better, because now it looks more like an editing mistake.

This one is intentional, so as it is possible to track correctly the
highest param ID found while browsing the nodes. IMO it would be
confusing to add that into gen_node_support.pl. Another thing that
could be done is to switch Param to have a custom implementation, like
RangeTblEntry, though this removes the automation around the creation
of _jumbleParam(). I have clarified the comments around that.

The handling of T_RangeTblEntry could be improved. In other contexts we
have for example a custom_copy attribute, which generates the switch entry
but not the function. Something like this could be useful here too.

Hmm. Okay. Fine by me.

Otherwise, this looks ok. I haven't checked whether it maintains the exact
behavior from before. What is the test coverage situation for this?

0003 taken in isolation has some minimal coverage through
pg_stat_statements, though it turns around 15% with compute_query_id =
auto that would enforce the jumbling path only when pg_stat_statements
uses it. Still, my plan here is to enforce the loading of
pg_stat_statements with compute_query_id = regress and
utility_query_id = jumble (if needed) in a new buildfarm machine,
because that's the cheapest path. An extra possibility is to have
pg_regress kicked in a new TAP test with these settings, but that's
costly and we have already two of these :/ Another possibility is to
plug in that into 027_stream_regress or the pg_upgrade test suite with
new settings :/

Anyway, the regression tests of pg_stat_statements should be extended
a bit to cover more node types by default (Say COPY with DMLs for the
InsertStmt & co) to look at how these are showing up once normalized
using their parsed query, and we don't do much around that now.
Normalizing more DDLs should use this path, as well.

For the 0004 patch, it should be documented why one would want one behavior
or the other. That's totally unclear right now.

I am not 100% sure whether we should have this part at the end, but as
an exit path in case somebody complains about the extra load with the
automated jumbling compared to the hash of the query strings, that can
be used as a backup. Anyway, attached is what would be a
clarification.

I think if we are going to accept 0004, then it might be better to combine
it with 0003. Otherwise, 0004 is just undoing a lot of the code structure
changes in JumbleQuery() that 0003 did.

Makes sense. That would be my intention if 0004 is the most
acceptable and splitting things makes things a bit easier to review.
--
Michael

Attachments:

v6-0003-Support-for-automated-query-jumble-with-all-Nodes.patchtext/x-diff; charset=us-asciiDownload+519-860
v6-0004-Add-GUC-utility_query_id.patchtext/x-diff; charset=us-asciiDownload+164-23
#18Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#17)
Re: Generating code for query jumbling through gen_node_support.pl

On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote:

Makes sense. That would be my intention if 0004 is the most
acceptable and splitting things makes things a bit easier to review.

There was a silly mistake in 0004 where the jumbling code relied on
compute_query_id rather than utility_query_id, so fixed and rebased as
of v7 attached.
--
Michael

Attachments:

v7-0003-Support-for-automated-query-jumble-with-all-Nodes.patchtext/x-diff; charset=us-asciiDownload+519-860
v7-0004-Add-GUC-utility_query_id.patchtext/x-diff; charset=us-asciiDownload+164-23
#19Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#17)
Re: Generating code for query jumbling through gen_node_support.pl

On 24.01.23 07:57, Michael Paquier wrote:

For the 0004 patch, it should be documented why one would want one behavior
or the other. That's totally unclear right now.

I am not 100% sure whether we should have this part at the end, but as
an exit path in case somebody complains about the extra load with the
automated jumbling compared to the hash of the query strings, that can
be used as a backup. Anyway, attached is what would be a
clarification.

Ok, the documentation make sense now. I wonder what the performance
impact is. Probably, nobody cares about microoptimizing CREATE TABLE
statements. But BEGIN/COMMIT could matter. However, whatever you do in
between the BEGIN and COMMIT will already be jumbled, so you're already
paying the overhead. Hopefully, jumbling such simple commands will have
no noticeable overhead.

In other words, we should test this and hopefully get rid of the
'string' method.

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#18)
Re: Generating code for query jumbling through gen_node_support.pl

On 25.01.23 01:08, Michael Paquier wrote:

On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote:

Makes sense. That would be my intention if 0004 is the most
acceptable and splitting things makes things a bit easier to review.

There was a silly mistake in 0004 where the jumbling code relied on
compute_query_id rather than utility_query_id, so fixed and rebased as
of v7 attached.

Overall, this looks good to me.

There are a couple of repetitive comments, like "typmod and collation
information are irrelevant for the query jumbling". This applies to all
nodes, so we don't need to repeat it for a number of nodes (and then not
mention it for other nodes). Maybe there should be a central place
somewhere that describes "these kinds of fields should normally be ignored".

#21Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#19)
#22Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#20)
#23Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#17)
#24Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#21)
#25Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#22)
#26Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#21)
#27Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#25)
#28Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#26)
#29Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#28)
#30Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#28)
#32Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#33)
#35Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#32)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#35)
#37Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#38)
#40Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#26)
#41Andrei Lepikhov
lepihov@gmail.com
In reply to: Michael Paquier (#40)
#42Michael Paquier
michael@paquier.xyz
In reply to: Andrei Lepikhov (#41)
#43Andrei Lepikhov
lepihov@gmail.com
In reply to: Michael Paquier (#42)
#44Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#40)