Add min and max execute statement time in pg_stat_statement
I submit patch adding min and max execute statement time in pg_stat_statement in
next CF.
pg_stat_statement have execution time, but it is average execution time and does
not provide detail information very much. So I add min and max execute statement
time in pg_stat_statement columns. Usage is almost same as before. However, I add
pg_stat_statements_reset_time() function to get min_time and max_time in the
specific period. This function resets or inits min and max execution time before.
Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
Attachments:
pg_stat_statements-min_max_exectime_v0.patchtext/x-diff; name=pg_stat_statements-min_max_exectime_v0.patchDownload+128-54
On 10/18/2013 04:02 AM, KONDO Mitsumasa wrote:
I submit patch adding min and max execute statement time in pg_stat_statement in
next CF.pg_stat_statement have execution time, but it is average execution time and does
not provide detail information very much. So I add min and max execute statement
time in pg_stat_statement columns. Usage is almost same as before. However, I add
pg_stat_statements_reset_time() function to get min_time and max_time in the
specific period. This function resets or inits min and max execution time before.
If we're going to extend pg_stat_statements, even more than min and max
I'd like to see the standard deviation in execution time.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(2013/10/18 22:21), Andrew Dunstan wrote:
If we're going to extend pg_stat_statements, even more than min and max
I'd like to see the standard deviation in execution time.
OK. I do! I am making some other patches, please wait more!
Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center.;
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22/10/13 00:17, KONDO Mitsumasa wrote:
(2013/10/18 22:21), Andrew Dunstan wrote:
If we're going to extend pg_stat_statements, even more than min and max
I'd like to see the standard deviation in execution time.OK. I do! I am making some other patches, please wait more!
Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center.;
How about the 'median', often a lot more useful than the 'arithmetic
mean' (which most people call the 'average').
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
If we're going to extend pg_stat_statements, even more than min and max
I'd like to see the standard deviation in execution time.
How about the 'median', often a lot more useful than the 'arithmetic
mean' (which most people call the 'average').
AFAIK, median is impossible to calculate cheaply (in particular, with
a fixed amount of workspace). So this apparently innocent request
is actually moving the goalposts a long way, because the space per
query table entry is a big concern for pg_stat_statements.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 21, 2013 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
If we're going to extend pg_stat_statements, even more than min and max
I'd like to see the standard deviation in execution time.How about the 'median', often a lot more useful than the 'arithmetic
mean' (which most people call the 'average').AFAIK, median is impossible to calculate cheaply (in particular, with
a fixed amount of workspace). So this apparently innocent request
is actually moving the goalposts a long way, because the space per
query table entry is a big concern for pg_stat_statements.
Yeah, and I worry about min and max not being very usable - once they
get pushed out to extreme values, there's nothing to drag them back
toward normality except resetting the stats, and that's not something
we want to encourage people to do frequently. Of course, averages over
very long sampling intervals may not be too useful anyway, dunno.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 21, 2013 at 1:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Yeah, and I worry about min and max not being very usable - once they
get pushed out to extreme values, there's nothing to drag them back
toward normality except resetting the stats, and that's not something
we want to encourage people to do frequently.
My thoughts exactly. Perhaps it'd be useful to separately invalidate
min/max times, without a full reset. But then you've introduced the
possibility of the average time (total_time/calls) exceeding the max
or being less than the min.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22/10/13 09:01, Tom Lane wrote:
Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
If we're going to extend pg_stat_statements, even more than min and max
I'd like to see the standard deviation in execution time.How about the 'median', often a lot more useful than the 'arithmetic
mean' (which most people call the 'average').AFAIK, median is impossible to calculate cheaply (in particular, with
a fixed amount of workspace). So this apparently innocent request
is actually moving the goalposts a long way, because the space per
query table entry is a big concern for pg_stat_statements.regards, tom lane
Yeah, obvious - in retrospect! :-)
One way it could be done, but even this would consume far too much
storage and processing power (hence totally impractical), would be to
'simply' store a counter for each value found and increment it for each
occurence...
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Gavin Flower wrote:
One way it could be done, but even this would consume far too much
storage and processing power (hence totally impractical), would be
to 'simply' store a counter for each value found and increment it
for each occurence...
An histogram? Sounds like a huge lot of code complexity to me. Not
sure the gain is enough.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/21/2013 04:43 PM, Peter Geoghegan wrote:
On Mon, Oct 21, 2013 at 1:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Yeah, and I worry about min and max not being very usable - once they
get pushed out to extreme values, there's nothing to drag them back
toward normality except resetting the stats, and that's not something
we want to encourage people to do frequently.My thoughts exactly. Perhaps it'd be useful to separately invalidate
min/max times, without a full reset. But then you've introduced the
possibility of the average time (total_time/calls) exceeding the max
or being less than the min.
This is why I suggested the standard deviation, and why I find it would
be more useful than just min and max. A couple of outliers will set the
min and max to
possibly extreme values but hardly perturb the standard deviation over a
large number of observations.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, and I worry about min and max not being very usable - once they
get pushed out to extreme values, there's nothing to drag them back
toward normality except resetting the stats, and that's not something
we want to encourage people to do frequently. Of course, averages over
very long sampling intervals may not be too useful anyway, dunno.
Good point, but that doesn't mean that the request is unreasonable.
For min/max, we could possibly address this concern by introducing an
exponential decay over time --- that is, every so often, you take some
small fraction of (max - min) and add that to the running min while
subtracting it from the max. Or some other variant on that theme. There
might be a way to progressively discount old observations for average too,
though I'm not sure exactly how at the moment.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
This is why I suggested the standard deviation, and why I find it would
be more useful than just min and max. A couple of outliers will set the
min and max to possibly extreme values but hardly perturb the standard
deviation over a large number of observations.
Hm. It's been a long time since college statistics, but doesn't the
entire concept of standard deviation depend on the assumption that the
underlying distribution is more-or-less normal (Gaussian)? Is there a
good reason to suppose that query runtime is Gaussian? (I'd bet not;
in particular, multimodal behavior seems very likely due to things like
plan changes.) If not, how much does that affect the usefulness of
a standard-deviation calculation?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/21/2013 07:29 PM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
This is why I suggested the standard deviation, and why I find it would
be more useful than just min and max. A couple of outliers will set the
min and max to possibly extreme values but hardly perturb the standard
deviation over a large number of observations.Hm. It's been a long time since college statistics, but doesn't the
entire concept of standard deviation depend on the assumption that the
underlying distribution is more-or-less normal (Gaussian)? Is there a
good reason to suppose that query runtime is Gaussian? (I'd bet not;
in particular, multimodal behavior seems very likely due to things like
plan changes.) If not, how much does that affect the usefulness of
a standard-deviation calculation?
IANA statistician, but the article at
<https://en.wikipedia.org/wiki/Standard_deviation> appears to have a
diagram with one sample that's multi-modal.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 21, 2013 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. It's been a long time since college statistics, but doesn't the
entire concept of standard deviation depend on the assumption that the
underlying distribution is more-or-less normal (Gaussian)?
I don't see how. The standard deviation here would be expressed in
units of milliseconds. Now, that could be misleading, in that like a
mean average, it might "mischaracterize" the distribution. But it's
still got to be a big improvement.
I like the idea of a decay, but can't think of a principled scheme offhand.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 22, 2013 at 1:09 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Gavin Flower wrote:
One way it could be done, but even this would consume far too much
storage and processing power (hence totally impractical), would be
to 'simply' store a counter for each value found and increment it
for each occurence...An histogram? Sounds like a huge lot of code complexity to me. Not
sure the gain is enough.
I have a proof of concept patch somewhere that does exactly this. I
used logarithmic bin widths. With 8 log10 bins you can tell the
fraction of queries running at each order of magnitude from less than
1ms to more than 1000s. Or with 31 bins you can cover factor of 2
increments from 100us to over 27h. And the code is almost trivial,
just take a log of the duration and calculate the bin number from that
and increment the value in the corresponding bin.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22/10/13 13:26, Ants Aasma wrote:
On Tue, Oct 22, 2013 at 1:09 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Gavin Flower wrote:
One way it could be done, but even this would consume far too much
storage and processing power (hence totally impractical), would be
to 'simply' store a counter for each value found and increment it
for each occurence...An histogram? Sounds like a huge lot of code complexity to me. Not
sure the gain is enough.I have a proof of concept patch somewhere that does exactly this. I
used logarithmic bin widths. With 8 log10 bins you can tell the
fraction of queries running at each order of magnitude from less than
1ms to more than 1000s. Or with 31 bins you can cover factor of 2
increments from 100us to over 27h. And the code is almost trivial,
just take a log of the duration and calculate the bin number from that
and increment the value in the corresponding bin.Regards,
Ants Aasma
That might be useful in determining if things are sufficiently bad to be
worth investigating in more detail. No point in tuning stuff that is
behaving acceptably.
Also good enough to say 95% execute within 5 seconds (or whatever).
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22/10/13 13:26, Ants Aasma wrote:
On Tue, Oct 22, 2013 at 1:09 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Gavin Flower wrote:
One way it could be done, but even this would consume far too much
storage and processing power (hence totally impractical), would be
to 'simply' store a counter for each value found and increment it
for each occurence...An histogram? Sounds like a huge lot of code complexity to me. Not
sure the gain is enough.I have a proof of concept patch somewhere that does exactly this. I
used logarithmic bin widths. With 8 log10 bins you can tell the
fraction of queries running at each order of magnitude from less than
1ms to more than 1000s. Or with 31 bins you can cover factor of 2
increments from 100us to over 27h. And the code is almost trivial,
just take a log of the duration and calculate the bin number from that
and increment the value in the corresponding bin.Regards,
Ants Aasma
I suppose this has to be decided at compile time to keep the code both
simple and efficient - if so, I like the binary approach.
Curious, why start at 100us? I suppose this might be of interest if
everything of note is in RAM and/or stuff is on SSD's.
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 22, 2013 at 4:00 AM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:
I have a proof of concept patch somewhere that does exactly this. I
used logarithmic bin widths. With 8 log10 bins you can tell the
fraction of queries running at each order of magnitude from less than
1ms to more than 1000s. Or with 31 bins you can cover factor of 2
increments from 100us to over 27h. And the code is almost trivial,
just take a log of the duration and calculate the bin number from that
and increment the value in the corresponding bin.I suppose this has to be decided at compile time to keep the code both
simple and efficient - if so, I like the binary approach.
For efficiency's sake it can easily be done at run time, one extra
logarithm calculation per query will not be noticeable. Having a
proper user interface to make it configurable and changeable is where
the complexity is. We might just decide to go with something good
enough as even the 31 bin solution would bloat the pg_stat_statements
data structure only by about 10%.
Curious, why start at 100us? I suppose this might be of interest if
everything of note is in RAM and/or stuff is on SSD's.
Selecting a single row takes about 20us on my computer, I picked 100us
as a reasonable limit below where the exact speed doesn't matter
anymore.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> writes:
Hm. It's been a long time since college statistics, but doesn't the
entire concept of standard deviation depend on the assumption that the
underlying distribution is more-or-less normal (Gaussian)? Is there a
I just had a quick chat with a statistician friends of mine on that
topic, and it seems that the only way to make sense of an average is if
you know already the distribution.
In our case, what I keep experiencing with tuning queries is that we
have like 99% of them running under acceptable threshold and 1% of them
taking more and more time.
In a normal (Gaussian) distribution, there would be no query time
farther away from the average than any other, so my experience tells me
that the query time distribution is anything BUT normal (Gaussian).
good reason to suppose that query runtime is Gaussian? (I'd bet not;
in particular, multimodal behavior seems very likely due to things like
plan changes.) If not, how much does that affect the usefulness of
a standard-deviation calculation?
I don't know what multi-modal is.
What I've been gathering from my quick chat this morning is that either
you know how to characterize the distribution and then the min max and
average are useful on their own, or you need to keep track of an
histogram where all the bins are of the same size to be able to learn
what the distribution actually is.
We didn't get to the point where I could understand if storing histogram
with a constant size on log10 of the data rather than the data itself is
going to allow us to properly characterize the distribution.
The main question I want to answer here would be the percentiles one, I
want to get the query max execution timing for 95% of the executions,
then 99%, then 99.9% etc. There's no way to answer that without knowing
the distribution shape, so we need enough stats to learn what the
distribution shape is (hence, histograms).
Of course keeping enough stats seems to always begin with keeping the
min, max and average, so we can just begin there. We would just be
unable to answer interesting questions with just that.
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 22, 2013 at 2:56 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:
Tom Lane <tgl@sss.pgh.pa.us> writes:
Hm. It's been a long time since college statistics, but doesn't the
entire concept of standard deviation depend on the assumption that the
underlying distribution is more-or-less normal (Gaussian)? Is there aI just had a quick chat with a statistician friends of mine on that
topic, and it seems that the only way to make sense of an average is if
you know already the distribution.In our case, what I keep experiencing with tuning queries is that we
have like 99% of them running under acceptable threshold and 1% of them
taking more and more time.
Agreed.
In a lot of Heroku's performance work, the Perc99 and Perc95 have
provided a lot more value that stddev, although stddev is a lot better
than nothing and probably easier to implement.
There are apparently high-quality statistical approximations of these
that are not expensive to compute and are small in memory representation.
That said, I'd take stddev over nothing for sure.
Handily for stddev, I think by snapshots of count(x), sum(x),
sum(x**2) (which I understand to be the components of stddev), I think
one can compute stddevs across different time spans using auxiliary
tools that sample this triplet on occasion. That's kind of a handy
property that I'm not sure if percN-approximates can get too easily.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers