pgsql: Introduce squashing of constant lists in query jumbling

Started by Alvaro Herreraabout 1 year ago10 messageshackers
Jump to latest
#1Alvaro Herrera
alvherre@2ndquadrant.com

Introduce squashing of constant lists in query jumbling

pg_stat_statements produces multiple entries for queries like
SELECT something FROM table WHERE col IN (1, 2, 3, ...)

depending on the number of parameters, because every element of
ArrayExpr is individually jumbled. Most of the time that's undesirable,
especially if the list becomes too large.

Fix this by introducing a new GUC query_id_squash_values which modifies
the node jumbling code to only consider the first and last element of a
list of constants, rather than each list element individually. This
affects both the query_id generated by query jumbling, as well as
pg_stat_statements query normalization so that it suppresses printing of
the individual elements of such a list.

The default value is off, meaning the previous behavior is maintained.

Author: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Sergey Dudoladov (mysterious, off-list)
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Marcos Pegoraro <marcos@f10.com.br>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Tested-by: Yasuo Honda <yasuo.honda@gmail.com>
Tested-by: Sergei Kornilov <sk@zsrv.org>
Tested-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Tested-by: Chengxi Sun <sunchengxi@highgo.com>
Tested-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: /messages/by-id/CA+q6zcWtUbT_Sxj0V6HY6EZ89uv5wuG5aefpe_9n0Jr3VwntFg@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/62d712ecfd940f60e68bde5b6972b6859937c412

Modified Files
--------------
contrib/pg_stat_statements/Makefile | 2 +-
contrib/pg_stat_statements/expected/squashing.out | 464 ++++++++++++++++++++++
contrib/pg_stat_statements/meson.build | 1 +
contrib/pg_stat_statements/pg_stat_statements.c | 76 +++-
contrib/pg_stat_statements/sql/squashing.sql | 180 +++++++++
doc/src/sgml/config.sgml | 30 ++
doc/src/sgml/pgstatstatements.sgml | 24 +-
src/backend/nodes/gen_node_support.pl | 19 +-
src/backend/nodes/queryjumblefuncs.c | 146 ++++++-
src/backend/postmaster/launch_backend.c | 3 +
src/backend/utils/misc/guc_tables.c | 10 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/nodes/nodes.h | 2 +
src/include/nodes/primnodes.h | 2 +-
src/include/nodes/queryjumble.h | 7 +
15 files changed, 945 insertions(+), 22 deletions(-)

#2Christoph Berg
myon@debian.org
In reply to: Alvaro Herrera (#1)
Squash constant lists in query jumbling by default

Re: �lvaro Herrera

Introduce squashing of constant lists in query jumbling

pg_stat_statements produces multiple entries for queries like
SELECT something FROM table WHERE col IN (1, 2, 3, ...)

depending on the number of parameters, because every element of
ArrayExpr is individually jumbled. Most of the time that's undesirable,
especially if the list becomes too large.

Fix this by introducing a new GUC query_id_squash_values which modifies
the node jumbling code to only consider the first and last element of a
list of constants, rather than each list element individually. This
affects both the query_id generated by query jumbling, as well as
pg_stat_statements query normalization so that it suppresses printing of
the individual elements of such a list.

The default value is off, meaning the previous behavior is maintained.

The "jumble names of temp tables" thread was briefly touching this [1]/messages/by-id/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s+huLLHMKR_+MCk8RPQ@mail.gmail.com,
I'm starting a new thread since the others are already very long.

[1]: /messages/by-id/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s+huLLHMKR_+MCk8RPQ@mail.gmail.com

Two points were made:

1) this should be on by default
2) there should be no GUC for it.

For 1), Sami said "Why would anyone not want to squash the IN list?"
to which I can only agree. Michael agreed as well, that's already +3.

For 2), Tom said that configurability is 1) often much less useful
than originally planned, and 2) tools have to cope with both settings
anyway, making implementing them harder. Plus, switching at run-time
makes the result even less predictable.

So, I would propose that we drop the GUC and make it the default.
Opinions?

Christoph

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christoph Berg (#2)
Re: Squash constant lists in query jumbling by default

Christoph Berg <myon@debian.org> writes:

For 2), Tom said that configurability is 1) often much less useful
than originally planned, and 2) tools have to cope with both settings
anyway, making implementing them harder. Plus, switching at run-time
makes the result even less predictable.

To clarify that last bit: if some clients run with the GUC on and some
with it off, you have a mess. Even statements that are completely
identical will have different query IDs under the two settings.

If this GUC sticks around, it should be at least PGC_SUSET (on
the analogy of compute_query_id) to make it harder to break
pg_stat_statements that way.

regards, tom lane

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: Squash constant lists in query jumbling by default

On 2025-Mar-25, Tom Lane wrote:

Christoph Berg <myon@debian.org> writes:

For 2), Tom said that configurability is 1) often much less useful
than originally planned, and 2) tools have to cope with both settings
anyway, making implementing them harder. Plus, switching at run-time
makes the result even less predictable.

To clarify that last bit: if some clients run with the GUC on and some
with it off, you have a mess. Even statements that are completely
identical will have different query IDs under the two settings.

True.

If this GUC sticks around, it should be at least PGC_SUSET (on
the analogy of compute_query_id) to make it harder to break
pg_stat_statements that way.

I have no problem making it superuser-only, and I can see making "on" be
the default. I am not opposed to removing it completely either, if we
really think that the current behavior is no longer useful for anybody.

Earlier in the discussion, other possible values for the option were
suggested, such as a way to distinguish arrays that had "lots" (say,
hundreds or more) of entries from arrays that were "small". That could
be selected by the user (or site admin) using this GUC, though there was
no agreement on exactly what that would be. During the FOSDEM 2024
development meeting there was a general dislike of this idea, which
AFAIR was mostly predicated on the displayed query no longer being valid
SQL. But now that we've chosen a format that uses SQL comments, this is
no longer a problem, so I think we haven't closed that door yet. But we
may still find out that no user cares about this.

Dmitry?

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos. Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

#5Sami Imseih
samimseih@gmail.com
In reply to: Alvaro Herrera (#4)
Re: Squash constant lists in query jumbling by default

If this GUC sticks around, it should be at least PGC_SUSET (on
the analogy of compute_query_id) to make it harder to break
pg_stat_statements that way.

I have no problem making it superuser-only, and I can see making "on" be
the default. I am not opposed to removing it completely either, if we
really think that the current behavior is no longer useful for anybody.

I am in favor of complete removal. [1]/messages/by-id/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s+huLLHMKR_+MCk8RPQ@mail.gmail.com will change the behavior of table
jumbling without introducing a GUC, and I don't think we should introduce
a GUC for the squash values case either. Why one behavior change is configurable
while the other is not? seems confusing, IMO.

Also, as a matter of principle, it seems most are favoring not
introducing GUCs to
configure queryId behavior. I agree.

[1]: /messages/by-id/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s+huLLHMKR_+MCk8RPQ@mail.gmail.com

--
Sami Imseih
Amazon Web Services (AWS)

#6Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Alvaro Herrera (#4)
Re: Squash constant lists in query jumbling by default

On Tue, Mar 25, 2025, 6:28 PM Álvaro Herrera <alvherre@alvh.no-ip.org>

wrote:

On 2025-Mar-25, Tom Lane wrote:

If this GUC sticks around, it should be at least PGC_SUSET (on

the analogy of compute_query_id) to make it harder to break

pg_stat_statements that way.

I have no problem making it superuser-only, and I can see making "on" be

the default. I am not opposed to removing it completely either, if we

really think that the current behavior is no longer useful for anybody.

I'm in favor of removing the GUC of course, but if memory serves there
were some folks in the patch discussion thread, who claimed they would
need to be able to keep non-squashed behavior. I don't recall if there were
particular arguments to support that, will try to find those messages again.
But overall as long as nobody objects, I think it's fine to get rid of GUC.

Earlier in the discussion, other possible values for the option were
suggested, such as a way to distinguish arrays that had "lots" (say,
hundreds or more) of entries from arrays that were "small". That could
be selected by the user (or site admin) using this GUC, though there was
no agreement on exactly what that would be. During the FOSDEM 2024
development meeting there was a general dislike of this idea, which
AFAIR was mostly predicated on the displayed query no longer being valid
SQL. But now that we've chosen a format that uses SQL comments, this is
no longer a problem, so I think we haven't closed that door yet. But we
may still find out that no user cares about this.

Agree, the way how things work now brings this option back on the table.
I can refresh the patch doing this, but I'm afk for about a week so it will
take some time. At the same time the proposal to do squashing by default
does not seem to be strictly dependent on that, so maybe they could be
considered as isolated ideas.

Show quoted text
#7Sami Imseih
samimseih@gmail.com
In reply to: Dmitry Dolgov (#6)
Re: Squash constant lists in query jumbling by default

At the same time the proposal to do squashing by default
does not seem to be strictly dependent on that, so maybe they could be
considered as isolated ideas.

Here is a patch to remove the GUC, if we settle on doing so.

--
Sami Imseih
Amazon Web Services (AWS)

Attachments:

v1-0001-Remove-the-query_id_squash_values-GUC.patchapplication/octet-stream; name=v1-0001-Remove-the-query_id_squash_values-GUC.patchDownload+8-98
#8Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Dmitry Dolgov (#6)
Re: Squash constant lists in query jumbling by default

On Tue, Mar 25, 2025, 7:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Tue, Mar 25, 2025, 6:28 PM Álvaro Herrera <alvherre@alvh.no-ip.org>

wrote:

On 2025-Mar-25, Tom Lane wrote:

If this GUC sticks around, it should be at least PGC_SUSET (on

the analogy of compute_query_id) to make it harder to break

pg_stat_statements that way.

I have no problem making it superuser-only, and I can see making "on" be

the default. I am not opposed to removing it completely either, if we

really think that the current behavior is no longer useful for anybody.

I'm in favor of removing the GUC of course, but if memory serves there
were some folks in the patch discussion thread, who claimed they would
need to be able to keep non-squashed behavior. I don't recall if there were
particular arguments to support that, will try to find those messages
again.

Nevermind, I've checked it out -- I think the case I had in mind [1]/messages/by-id/b8721722-a73e-0ee9-6513-425e9c88d92f@postgrespro.ru in fact
supports GUC removal:

If anyone subtly changes jumbling logic when the extension is

active, the instance could get huge performance issues.

[1]: /messages/by-id/b8721722-a73e-0ee9-6513-425e9c88d92f@postgrespro.ru
/messages/by-id/b8721722-a73e-0ee9-6513-425e9c88d92f@postgrespro.ru

Show quoted text
#9Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Christoph Berg (#2)
Re: Squash constant lists in query jumbling by default

On Tue, 2025-03-25 at 17:28 +0100, Christoph Berg wrote:

The "jumble names of temp tables" thread was briefly touching this [1],
I'm starting a new thread since the others are already very long.

[1] /messages/by-id/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s+huLLHMKR_+MCk8RPQ@mail.gmail.com

Two points were made:

1) this should be on by default
2) there should be no GUC for it.

+1 on both

Yours,
Laurenz Albe

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Laurenz Albe (#9)
Re: Squash constant lists in query jumbling by default

On 2025-Mar-25, Laurenz Albe wrote:

On Tue, 2025-03-25 at 17:28 +0100, Christoph Berg wrote:

The "jumble names of temp tables" thread was briefly touching this [1],
I'm starting a new thread since the others are already very long.

[1] /messages/by-id/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s+huLLHMKR_+MCk8RPQ@mail.gmail.com

Two points were made:

1) this should be on by default
2) there should be no GUC for it.

+1 on both

Well, the votes are quite clear, so I have pushed Sami's patch.

Thank, everybody!

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense." (Paul Thomas)