Normalize queries starting with SET for pg_stat_statements

Started by Greg Sabino Mullaneover 1 year ago7 messageshackers

greg@turnstep.com

over 1 year ago

I saw a database recently where some app was inserting the source port into
the application_name field, which meant that pg_stat_statements.max was
quickly reached and queries were simply pouring in and out of
pg_stat_statements, dominated by some "SET application_name = 'myapp
10.0.0.1:1234'" calls. Which got me thinking, is there really any value to
having non-normalized 'SET application_name' queries inside of
pg_stat_statements? Or any SET stuff, for that matter?

Attached please find a small proof-of-concept for normalizing/de-jumbling
certain SET queries. Because we only want to cover the VAR_SET_VALUE parts
of VariableSetStmt, a custom jumble func was needed. There are a lot of
funky SET things inside of gram.y as well that don't do the standard SET X
= Y formula (e.g. SET TIME ZONE, SET SCHEMA). I tried to handle those as
best I could, and carved a couple of exceptions for time zones and xml.

I'm not sure where else to possibly draw lines. Obviously calls to time
zone have a small and finite pool of possible values, so easy enough to
exclude them, while things like application_name and work_mem are fairly
infinite, so great candidates for normalizing. One could argue for simply
normalizing everything, as SET is trivially fast for purposes of
performance tracking via pg_stat_statements, so who cares if we don't have
the exact string? That's what regular logging is for, after all. Most
importantly, less unique queryids means less chance that errant SETs will
crowd out the more important stuff.

In summary, we want to change this:

SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1;
1 | set application_name = 'alice'
1 | set application_name = 'bob'
1 | set application_name = 'eve'
1 | set application_name = 'mallory'

to this:

SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1;
4 | set application_name = $1

I haven't updated the regression tests yet, until we reach a consensus on
how thorough the normalizing should be. But there is a new test to exercise
the changes in gram.y.

Cheers,
Greg

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Greg Sabino Mullane (#1)

Re: Normalize queries starting with SET for pg_stat_statements

On Mon, Jul 22, 2024 at 03:23:50PM -0400, Greg Sabino Mullane wrote:

I saw a database recently where some app was inserting the source port into
the application_name field, which meant that pg_stat_statements.max was
quickly reached and queries were simply pouring in and out of
pg_stat_statements, dominated by some "SET application_name = 'myapp
10.0.0.1:1234'" calls. Which got me thinking, is there really any value to
having non-normalized 'SET application_name' queries inside of
pg_stat_statements? Or any SET stuff, for that matter?

Thanks for beginning this discussion. This has been mentioned in the
past, like here, but I never came back to it:
/messages/by-id/B44FA29D-EBD0-4DD9-ABC2-16F1CB087074@amazon.com

I've seen complaints about that in the field myself, and like any
other specific workloads, the bloat this stuff creates can be really
annoying when utilities are tracked. So yes, I really want more stuff
to happen here.

Attached please find a small proof-of-concept for normalizing/de-jumbling
certain SET queries. Because we only want to cover the VAR_SET_VALUE parts
of VariableSetStmt, a custom jumble func was needed. There are a lot of
funky SET things inside of gram.y as well that don't do the standard SET X
= Y formula (e.g. SET TIME ZONE, SET SCHEMA). I tried to handle those as
best I could, and carved a couple of exceptions for time zones and xml.

Agreed about the use of a custom jumble function. A huge issue that I
have with this parsing node is how much we want to control back in the
monitoring reports while not making the changes too invasive in the
structure of VariableSetStmt. The balance between invasiveness and
level of normalization was the tricky part for me.

I'm not sure where else to possibly draw lines. Obviously calls to time
zone have a small and finite pool of possible values, so easy enough to
exclude them, while things like application_name and work_mem are fairly
infinite, so great candidates for normalizing. One could argue for simply
normalizing everything, as SET is trivially fast for purposes of
performance tracking via pg_stat_statements, so who cares if we don't have
the exact string? That's what regular logging is for, after all. Most
importantly, less unique queryids means less chance that errant SETs will
crowd out the more important stuff.

In summary, we want to change this:

SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1;
1 | set application_name = 'alice'
1 | set application_name = 'bob'
1 | set application_name = 'eve'
1 | set application_name = 'mallory'

to this:

SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1;
4 | set application_name = $1

Yep. That makes sense to me. We should keep the parameter name, hide
the value. CallStmt does that.

I haven't updated the regression tests yet, until we reach a consensus on
how thorough the normalizing should be. But there is a new test to exercise
the changes in gram.y.

It would be nice to maximize the contents of the tests at SQL level.
The number of patterns to track makes the debuggability much harder to
track correctly in TAP as we may rely on more complex quals in
pg_stat_statements or even other catalogs.

+   if (expr->kind != VAR_SET_VALUE ||
+       strcmp(expr->name, "timezone") == 0
+       || strcmp(expr->name, "xmloption") == 0)
+       JUMBLE_NODE(args);

We should do this kind of tracking with a new flag in the structure
itself, not in the custom function. The point would be to have
folks introducing new hardcoded names in VariableSetStmt in the parser
think about what they want to do, not second-guess it by tweaking the
custom jumbling function. Relying on the location would not be enough
as we need to cover document_or_content for xmloption or the default
keyword for TIME ZONE. Let's just use the same trick as
DeallocateStmt.isall, with a new field to differentiate all these
cases :)

There are some funky cases you are not mentioning, though, like SET in
a CREATE FUNCTION:
CREATE OR REPLACE FUNCTION foo_function(data1 text) RETURNS text AS $$
DECLARE
res text;
BEGIN
SELECT data1::text INTO res;
RETURN res;
END;
$$ LANGUAGE plpgsql IMMUTABLE SET search_path = 'pg_class,public';

Your patch silences the SET value, but perhaps we should not do that
for this case. I am not against normalizing that, actually, I am in
favor of it, because it makes the implementation much easier and
the FunctionParameter List of parameters is jumbled with
CreateFunctionStmt. All that requires test coverage.

It's nice to see that you are able to keep SET TRANSACTION at their
current level with the location trick.

You have issues with RESET SESSION AUTHORIZATION. This one is easy:
update the location field to -1 in reset_rest for all the subcommands.

The coverage of the regression tests in pg_stat_statements is mostly
right. I may be missing something, but all the SQL queries you have
in your 020_jumbles.pl would work fine with SQL tests, and some like
`SET param = value` are already covered.
--
Michael

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: Michael Paquier (#2)

Re: Normalize queries starting with SET for pg_stat_statements

Now that I've spent some time away from this, I'm reconsidering why we are
going through all the trouble of semi-jumbling SET statements. Maybe we
just keep it simple and everything becomes "SET myvar = $1" or even "SET
myvar" full stop? I'm having a hard time finding a real-world situation in
which we need to distinguish different SET/RESET items within
pg_stat_statements.

Cheers,
Greg

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Greg Sabino Mullane (#3)

Re: Normalize queries starting with SET for pg_stat_statements

On Tue, Aug 13, 2024 at 10:54:34AM -0400, Greg Sabino Mullane wrote:

Now that I've spent some time away from this, I'm reconsidering why we are
going through all the trouble of semi-jumbling SET statements. Maybe we
just keep it simple and everything becomes "SET myvar = $1" or even "SET
myvar" full stop?

Showing a dollar-character to show the fact that we have a value
behind makes the post sense to me.

I'm having a hard time finding a real-world situation in
which we need to distinguish different SET/RESET items within
pg_stat_statements.

I'm -1 on keeping the distinction, and AFAIK it's not really different
with the underlying problems that we need to solve for SET TRANSACTION
and the kind, no?

FWIW, I'm OK with hiding the value when it comes to a SET clause in a
CREATE FUNCTION. We already hide the contents of SQL queries inside
the SQL functions when these are queries that can be normalized, so
there is a kind of thin argument for consistency, or something close
to that.
--
Michael

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Michael Paquier (#4)

Re: Normalize queries starting with SET for pg_stat_statements

On Mon, Aug 19, 2024 at 03:28:52PM +0900, Michael Paquier wrote:

FWIW, I'm OK with hiding the value when it comes to a SET clause in a
CREATE FUNCTION. We already hide the contents of SQL queries inside
the SQL functions when these are queries that can be normalized, so
there is a kind of thin argument for consistency, or something close
to that.

This thread was one of the things I wanted to look at for this commit
fest because that's an issue I have on my stack for some time. And
here you go with a revised patch set.

First, the TAP test proposed upthread is not required, so let's remove
it and rely on pg_stat_statements. It is true that there are a lot of
coverage holes in pg_stat_statements with the various flavors of SET
queries. The TAP test was able to cover a good part of them, still
missed a few spots.

A second thing is that like you I have settled down to a custom
implementation for VariableSetStmt because we have too many grammar
patterns, some of them with values nested in clauses (SET TIME ZONE is
one example), and we should report the grammar keywords without
hardcoding a location. Like for the TIME ZONE part, this comes to a
limitation because it is not possible to normalize the case of SET
TIME ZONE 'value' without some refactoring of gram.y. Perhaps we
could do that in the long-term, but I come down to the fact that I'm
also OK with letting things as they are in the patch, because the
primary case I want to tackle at the SET name = value patterns, and
SET TIME ZONE is just a different flavor of that that can be
translated as well to the most common "name = value" pattern. A
second case is SET SESSION CHARACTERISTICS AS TRANSACTION with its
list of options. Contrary to the patch proposed, I don't think that
it is a good idea to hide that arguments may be included in the
jumbling in the custom function, so I have added one field in
VariableSetStmt to do that, and documented why we need the field in
parsenodes.h. That strikes me as the best balance, and that's going
to be hard to miss each time somebody adds a new grammar for
VariableSetStmt. That's very unlikely at this stage of the project,
but who knows, people like fancy new features.

Attached are two patches:
- 0001 adds a bunch of tests in pg_stat_statements, to cover the SET
patterns. (typo in commit message of this patch, will fix later)
- 0002 is the addition of the normalization. It is possible to see
how the normalization changes things in pg_stat_statements.

0001 is straight-forward and that was I think a mistake to not include
that from the start when I've expanded these tests in the v16 cycle
(well, time..). 0002 also is quite conservative at the end, and this
design can be used to tune easily the jumbling patterns from gram.y
depending on the feedback we'd get.
--
Michael

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Michael Paquier (#5)

Re: Normalize queries starting with SET for pg_stat_statements

On Tue, Sep 24, 2024 at 04:57:28PM +0900, Michael Paquier wrote:

0001 is straight-forward and that was I think a mistake to not include
that from the start when I've expanded these tests in the v16 cycle
(well, time..). 0002 also is quite conservative at the end, and this
design can be used to tune easily the jumbling patterns from gram.y
depending on the feedback we'd get.

Applied 0001 for now to expand the tests, with one tweak: the removal
of SET NAMES. It was proving tricky to use something else than UTF-8,
the CI complaining on Windows. True that this could use like unaccent
and an alternate output in a different file, but I'm not inclined to
take the cost just for this specific query pattern.

The remaining 0002 is attached for now. I am planning to wrap that
next week after a second lookup, except if there are any comments, of
course.
--
Michael

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Michael Paquier (#6)

Re: Normalize queries starting with SET for pg_stat_statements

On Wed, Sep 25, 2024 at 12:10:02PM +0900, Michael Paquier wrote:

The remaining 0002 is attached for now. I am planning to wrap that
next week after a second lookup, except if there are any comments, of
course.

And done with that, after a second round, tweaking some comments.

Thanks Greg for sending the patch and pushing this feature forward.
This finishes what I had on my TODO bucket in terms of normalization
when this stuff has begun in v16 with 3db72ebcbe20.
--
Michael

Normalize queries starting with SET for pg_stat_statements

Attachments:

Attachments:

Attachments: