Generating code for query jumbling through gen_node_support.pl

Started by Michael Paquierover 3 years ago44 messageshackers

michael@paquier.xyz

over 3 years ago

Hi all,

This thread is a follow-up of the recent discussion about query
jumbling with DDL statements, where the conclusion was that we'd want
to generate all this code automatically for all the nodes:
/messages/by-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com

What this patch allows to do it to compute the same query ID for
utility statements using their parsed Node state instead of their
string, meaning that things like "BEGIN", "bEGIN" or "begin" would be
treated the same, for example. But the main idea is not only that.

I have implemented that as of the attached, where the following things
are done:
- queryjumble.c is moved to src/backend/nodes/, to stick with the
other things for node equal/read/write/copy, renamed to
jumblefuncs.c.
- gen_node_support.c is extended to generate the functions and the
switch for the jumbling. There are a few exceptions, as of the Lists
and RangeTblEntry to do the jumbling consistently.
- Two pg_node_attr() are added in consistency with the existing ones:
no_jumble to discard completely a node from the the query jumbling
and jumble_ignore to discard one field from the jumble.

The patch is in a rather good shape, passes check-world and the CI,
but there are a few things that need to be discussed IMO. Things
could be perhaps divided in more patches, now the areas touched are
quite different so it did not look like a big deal to me as the
changes touch different areas and are straight-forward.

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not. There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

Note that the plan is to extend the normalization to some other parts
of the Nodes, like CALL and SET as mentioned on the other thread. I
have done nothing about that yet but doing so can be done in a few
lines with the facility presented here (aka just add a location
field). Hence, the normalization is consistent with the existing
queryjumble.c for the fields and the nodes processed.

In this patch, things are done so as the query ID is not computed
anymore from the query string but from the Query. I still need to
study the performance impact of that with short queries. If things
prove to be noticeable in some cases, this stuff could be switched to
use a new GUC where we could have a code path for the computation of
utilityStmt using its string as a fallback. I am not sure that I want
to enter in this level of complications, though, to keep things
simple, but that's yet to be done.

A bit more could be cut but pg_ident got in the way.. There are also
a few things for pg_stat_statements where a query ID of 0 can be
implied for utility statements in some cases.

Generating this code leads to an overall removal of code as what
queryjumble.c is generated automatically:
13 files changed, 901 insertions(+), 1113 deletions(-)

I am adding that to the next commit fest.

Thoughts?
--
Michael

vignesh C

vignesh21@gmail.com

over 3 years ago

In reply to: Michael Paquier (#1)

Re: Generating code for query jumbling through gen_node_support.pl

On Wed, 7 Dec 2022 at 13:27, Michael Paquier <michael@paquier.xyz> wrote:

Hi all,

This thread is a follow-up of the recent discussion about query
jumbling with DDL statements, where the conclusion was that we'd want
to generate all this code automatically for all the nodes:
/messages/by-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com

What this patch allows to do it to compute the same query ID for
utility statements using their parsed Node state instead of their
string, meaning that things like "BEGIN", "bEGIN" or "begin" would be
treated the same, for example. But the main idea is not only that.

I have implemented that as of the attached, where the following things
are done:
- queryjumble.c is moved to src/backend/nodes/, to stick with the
other things for node equal/read/write/copy, renamed to
jumblefuncs.c.
- gen_node_support.c is extended to generate the functions and the
switch for the jumbling. There are a few exceptions, as of the Lists
and RangeTblEntry to do the jumbling consistently.
- Two pg_node_attr() are added in consistency with the existing ones:
no_jumble to discard completely a node from the the query jumbling
and jumble_ignore to discard one field from the jumble.

The patch is in a rather good shape, passes check-world and the CI,
but there are a few things that need to be discussed IMO. Things
could be perhaps divided in more patches, now the areas touched are
quite different so it did not look like a big deal to me as the
changes touch different areas and are straight-forward.

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not. There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

Note that the plan is to extend the normalization to some other parts
of the Nodes, like CALL and SET as mentioned on the other thread. I
have done nothing about that yet but doing so can be done in a few
lines with the facility presented here (aka just add a location
field). Hence, the normalization is consistent with the existing
queryjumble.c for the fields and the nodes processed.

In this patch, things are done so as the query ID is not computed
anymore from the query string but from the Query. I still need to
study the performance impact of that with short queries. If things
prove to be noticeable in some cases, this stuff could be switched to
use a new GUC where we could have a code path for the computation of
utilityStmt using its string as a fallback. I am not sure that I want
to enter in this level of complications, though, to keep things
simple, but that's yet to be done.

A bit more could be cut but pg_ident got in the way.. There are also
a few things for pg_stat_statements where a query ID of 0 can be
implied for utility statements in some cases.

Generating this code leads to an overall removal of code as what
queryjumble.c is generated automatically:
13 files changed, 901 insertions(+), 1113 deletions(-)

I am adding that to the next commit fest.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_4047.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
b82557ecc2ebbf649142740a1c5ce8d19089f620 ===
=== applying patch
./0001-Support-for-automated-query-jumble-with-all-Nodes.patch
...
patching file src/backend/utils/misc/queryjumble.c
Hunk #1 FAILED at 1.
Not deleting file src/backend/utils/misc/queryjumble.c as content
differs from patch
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/utils/misc/queryjumble.c.rej

[1]: http://cfbot.cputube.org/patch_41_4047.log

Regards,
Vignesh

Peter Eisentraut

peter_e@gmx.net

over 3 years ago

In reply to: Michael Paquier (#1)

Re: Generating code for query jumbling through gen_node_support.pl

On 07.12.22 08:56, Michael Paquier wrote:

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not.

The generation script already has a way to identify location fields, by
($t eq 'int' && $f =~ 'location$'), so you could use that as well.

There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

I'm concerned about the large number of additional field annotations
this adds. We have been careful so far to document the use of each
attribute, e.g., *why* does a field not need to be copied etc. This
patch adds dozens and dozens of annotations without any explanation at
all. Now, the code this replaces also has no documentation, but maybe
this is the time to add some.

Generating code for query jumbling through gen_node_support.pl

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: