Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

Started by Peter Geogheganabout 10 years ago14 messages
#1Peter Geoghegan
pg@heroku.com

On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg@heroku.com> wrote:

pg_stat_statements' fingerprinting logic considers the following two
statements as distinct:

select 1 in (1, 2, 3);
select 1 in (1, 2, 3, 4);

This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
of elements recursively. In this case it's a list of Const nodes, and
the fingerprinting logic jumbles those nodes indifferently.

Somebody told me that they think that pg_stat_statements should not do
that. This person felt that it would be preferable for such
expressions to be normalized without regard to the number of distinct
Const elements. I suppose that that would work by determing if the
ArrayExpr elements list was a list of Const nodes and only const
nodes. Iff that turned out to be the case, something else would be
jumbled (something other than the list) that would essentially be a
representation of "some list of zero or more (or maybe one or more)
Const nodes with consttype of, in this example, 23". I think that this
would make at least one person happy, because of course the two
statements above would have their costs aggregated within a single
pg_stat_statements entry.

Baron Schwartz recently gave a talk at PGConf Silicon Valley about the
proprietary query instrumentation tool, VividCortex. The slides are
available from:

http://info.citusdata.com/rs/235-CNE-301/images/Analyzing_PostgreSQL_Network_Traffic_with_vc-pgsql-sniffer_-_Baron_Schwartz.pdf

One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

His theory is that you should allow a proprietary binary to run with
root permissions, a binary that sniffs the wire protocol, mostly
because pg_stat_statements has this limitation (all other
pg_stat_statements limitations listed are pretty unconvincing, IMV).
That doesn't seem like a great plan to me, but I think he has a point
about pg_stat_statements. It's about time that we fixed this -- it
isn't realistic to imagine that people are going to know to use an
array constant like "= any ('{1,2,3}')" -- even a major contributor to
Django that I talked to about this issue a couple of years ago didn't
know about that. It also isn't portable across database systems.

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

* How might altering the jumbling logic to make it recognize a
variable number of constants as equivalent work in practice? Perhaps
we should do something to flatten the representation based on which
powers of two the number of constants is between. There are still some
details to work out there, but that's my first idea. That seems like a
good compromise between the current behavior, and completely
disregarding the number of constants.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Greg Stark
stark@mit.edu
In reply to: Peter Geoghegan (#1)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan <pg@heroku.com> wrote:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

I also always felt there should be some kind of ??? symbol to
represent a list of constants. In my experience these lists are
actually *more* likely to be variables being inlined than single
constants since it's easy to use :1 etc for single constants and quite
a bit trickier to do it for lists. I guess that might be changed these
days since I think you can do =any(?::int[]) and construct an array
literal as a parameter. But plenty of code actually constructs lists
of question marks to interpolate.

I have also seen code where I would have needed *not* to have this
jumbling though. I think this is a general problem with jumbling that
there needs to be some kind of intelligence that deduces when it's
important to break out the statistics by constant. In my case it was
an IN query where specific values were very common but others very
rare. Partial indexes ended up being the solution and we had to
identify which partial indexes were needed.

Incidentally there's another feature pg_stat_statements *really*
needs. A way to convert a jumbled statement into one that can be
prepared easily. The use of ? instead of :1 :2 etc makes this a
mechanical but annoying process. Adding ??? would make it even more
annoying. Even just a function that does this (and takes an optional
list of counts for lists I guess?) would be a big help.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Ants Aasma
ants.aasma@eesti.ee
In reply to: Greg Stark (#2)
Re: Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Tue, Nov 24, 2015 at 2:25 PM, Greg Stark <stark@mit.edu> wrote:

On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan <pg@heroku.com> wrote:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

+1 for changing jumbling logic to consider any number of constants as identical.

I have also seen code where I would have needed *not* to have this
jumbling though. I think this is a general problem with jumbling that
there needs to be some kind of intelligence that deduces when it's
important to break out the statistics by constant. In my case it was
an IN query where specific values were very common but others very
rare. Partial indexes ended up being the solution and we had to
identify which partial indexes were needed.

This is an issue with jumbling in general and very vaguely related to
merging the number of constants in IN. I think a better solution for
this type of issue would be a way to sample the parameter values and
possibly EXPLAIN ANALYZE results to logs. Having runtime intelligence
in PostgreSQL that would be capable of distinguishing interesting
variations from irrelevant doesn't seem like a feasible plan. In my
view the best we could do is to aim to have entries roughly correspond
to application query invocation points and leave the more complex
statistical analysis use cases to more specialized tools.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#1)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

Peter Geoghegan <pg@heroku.com> writes:

On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg@heroku.com> wrote:

pg_stat_statements' fingerprinting logic considers the following two
statements as distinct:

select 1 in (1, 2, 3);
select 1 in (1, 2, 3, 4);

This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
of elements recursively. In this case it's a list of Const nodes, and
the fingerprinting logic jumbles those nodes indifferently.

I think this is a vastly oversimplified explanation of the problem.
In particular, because the planner will flatten an ArrayExpr containing
only Const nodes to an array constant (see eval_const_expressions),
I don't believe the case ever arises in exactly the form you posit here.

A portion of the problem is possibly due to the heuristics in
parse_expr.c's transformAExprIn():

* We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only
* possible if there is a suitable array type available. If not, we fall
* back to a boolean condition tree with multiple copies of the lefthand
* expression. Also, any IN-list items that contain Vars are handled as
* separate boolean conditions, because that gives the planner more scope
* for optimization on such clauses.

If the original text actually involves a variable number of Vars, then you
will end up with a boolean expression with a varying number of OR arms,
even if the Vars later get flattened to constants. However, it's not
clear to me that anyone would expect such cases to be treated as
identical. Another possibility is a type clash, for example
"x IN (42, 44.1)" will end up as a boolean tree for lack of a common
type for the would-be array elements. That case might possibly be an
issue in practice.

But what seems more likely to be annoying people is cases in which the
original text contains a varying number of Param markers. Those might or
might not get folded to constants during planning depending on context,
so that they might or might not look different to pg_stat_statements.

So I suspect the real problem here is that we might want all of these
things to look identical to pg_stat_statements:

ARRAY[$1, $2, 42]
ARRAY[$1, $2, $3, 47]
'{1,2,3,47}'::int[]

Don't see a very clean way to do that ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#4)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On 11/24/15 1:29 PM, Tom Lane wrote:

So I suspect the real problem here is that we might want all of these
things to look identical to pg_stat_statements:

ARRAY[$1, $2, 42]
ARRAY[$1, $2, $3, 47]
'{1,2,3,47}'::int[]

Don't see a very clean way to do that ...

Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other
words, treating an integer as text. A lot of frameworks like to do that
and just push the problem onto the database. I'm not sure what
pg_stat_statements would ultimately see in that case..

Since there's a few different things people might want, maybe a good
first step is to allow extending/changing the jumbling decision at the C
level. That would make it easy for a knowledgeable enough person to come
up with an alternative as a plugin that regular users could use.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Peter Geoghegan
pg@heroku.com
In reply to: Jim Nasby (#5)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words,
treating an integer as text. A lot of frameworks like to do that and just
push the problem onto the database. I'm not sure what pg_stat_statements
would ultimately see in that case..

They do?

postgres=# select 5::int4 in ('5');
?column?
──────────
t
(1 row)

postgres=# select 5::int4 in ('5a');
ERROR: 22P02: invalid input syntax for integer: "5a"
LINE 1: select 5::int4 in ('5a');
^
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

I wrote:

Peter Geoghegan <pg@heroku.com> writes:

This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
of elements recursively. In this case it's a list of Const nodes, and
the fingerprinting logic jumbles those nodes indifferently.

I think this is a vastly oversimplified explanation of the problem.
In particular, because the planner will flatten an ArrayExpr containing
only Const nodes to an array constant (see eval_const_expressions),
I don't believe the case ever arises in exactly the form you posit here.

Um ... disregard that. For some reason I was thinking that
pg_stat_statements looks at plan trees, but of course what it looks at
is the query tree immediately post-parse-analysis. So the behavior of
the const-folding pass is not relevant.

The heuristics in transformAExprIn() are still relevant, though, and
I suspect that the question of whether Params should be considered
the same as Consts is also highly relevant.

I wonder whether we could improve this by arranging things so that both
Consts and Params contribute zero to the jumble hash, and a list of these
things also contributes zero, regardless of the length of the list.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Peter Geoghegan (#6)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On 11/24/15 7:46 PM, Peter Geoghegan wrote:

On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words,
treating an integer as text. A lot of frameworks like to do that and just
push the problem onto the database. I'm not sure what pg_stat_statements
would ultimately see in that case..

They do?

postgres=# select 5::int4 in ('5');
?column?
──────────
t
(1 row)

postgres=# select 5::int4 in ('5a');
ERROR: 22P02: invalid input syntax for integer: "5a"
LINE 1: select 5::int4 in ('5a');
^

I'm not following your point. Obviously you can't compare int to text
that doesn't convert back to an int, but that's not what I was talking
about.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Geoghegan
pg@heroku.com
In reply to: Jim Nasby (#8)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Tue, Nov 24, 2015 at 10:55 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

I'm not following your point. Obviously you can't compare int to text that
doesn't convert back to an int, but that's not what I was talking about.

I didn't see what else you could have meant. In any case, the type
text has no involvement in your example.

pg_stat_statements sees the following SQL queries as equivalent:

select integer '5';

select 6;

select 7::int4;

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Lukas Fittl
lukas@fittl.com
In reply to: Peter Geoghegan (#1)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:

One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

As someone who runs a little monitoring service thats solely based on
pg_stat_statements data, ignoring IN list length would certainly be a good
change.

We currently do this in post-processing, together with other data cleanup
(e.g. ignoring the length of a VALUES list in an INSERT statement).

Given the fact that pgss data is normalized & you don't know which plan was
chosen, I don't think distinguishing based on the length of the list helps
anyone really.

I see pg_stat_statements as a high-level overview of which queries have
run, and which ones you might want to look into closer using e.g.
auto_explain.

Best,
Lukas

--
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630

#11Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Lukas Fittl (#10)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl <lukas@fittl.com> wrote:

On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:

One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

As someone who runs a little monitoring service thats solely based on
pg_stat_statements data, ignoring IN list length would certainly be a good
change.

We currently do this in post-processing, together with other data cleanup
(e.g. ignoring the length of a VALUES list in an INSERT statement).

Given the fact that pgss data is normalized & you don't know which plan
was chosen, I don't think distinguishing based on the length of the list
helps anyone really.

I see pg_stat_statements as a high-level overview of which queries have
run, and which ones you might want to look into closer using e.g.
auto_explain.

I still have the plans to try to marry pg_stat_statements and auto_explain
for the next iteration of "online query plans" extension I was proposing a
few months ago, and the first thing I was going to look into is rectifying
this problem with IN() jumbling. So, have a +1 from me.

--
Alex

#12Bruce Momjian
bruce@momjian.us
In reply to: Shulgin, Oleksandr (#11)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Mon, Nov 30, 2015 at 02:41:02PM +0100, Shulgin, Oleksandr wrote:

On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl <lukas@fittl.com> wrote:

On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:

One specific justification he gave for not using pg_stat_statements
was:

"Doesn’t merge bind vars in IN()" (See slide #11)

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

As someone who runs a little monitoring service thats solely based on
pg_stat_statements data, ignoring IN list length would certainly be a good
change.

We currently do this in post-processing, together with other data cleanup
(e.g. ignoring the length of a VALUES list in an INSERT statement).

Given the fact that pgss data is normalized & you don't know which plan was
chosen, I don't think distinguishing based on the length of the list helps
anyone really.

I see pg_stat_statements as a high-level overview of which queries have
run, and which ones you might want to look into closer using e.g.
auto_explain.

I still have the plans to try to marry pg_stat_statements and auto_explain for
the next iteration of "online query plans" extension I was proposing a few
months ago, and the first thing I was going to look into is rectifying this
problem with IN() jumbling.  So, have a +1 from me.

Is this a TODO?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Peter Geoghegan
pg@heroku.com
In reply to: Bruce Momjian (#12)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian <bruce@momjian.us> wrote:

Is this a TODO?

It's on my (very long) personal TODO list. It would be great if
someone else took it, though. So, yes.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#13)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Wed, Dec 30, 2015 at 05:21:05PM -0800, Peter Geoghegan wrote:

On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian <bruce@momjian.us> wrote:

Is this a TODO?

It's on my (very long) personal TODO list. It would be great if
someone else took it, though. So, yes.

TODO added.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers