More stable query plans via more predictable column statistics

Started by Shulgin, Oleksandrover 10 years ago62 messageshackers
Jump to latest
#1Shulgin, Oleksandr
oleksandr.shulgin@zalando.de

Hi Hackers!

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.

Introduction
============

We have observed that for certain data sets the distribution of samples
between most_common_vals and histogram_bounds can be unstable: so that it
may change dramatically with the next ANALYZE run, thus leading to
radically different plans.

I was revisiting the following performance thread and I've found some
interesting details about statistics in our environment:

/messages/by-id/CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com

My initial interest was in evaluation if distribution of samples could be
made more predictable and less dependent on the factor of luck, thus
leading to more stable execution plans.

Unexpected findings
===================

What I have found is that in a significant percentage of instances, when a
duplicate sample value is *not* put into the MCV list, it does produce
duplicates in the histogram_bounds, so it looks like the MCV cut-off
happens too early, even though we have enough space for more values in the
MCV list.

In the extreme cases I've found completely empty MCV lists and histograms
full of duplicates at the same time, with only about 20% of distinct values
in the histogram (as it turns out, this happens due to high fraction of
NULLs in the sample).

Data set and approach
=====================

In order to obtain these insights into distribution of statistics samples
on one of our bigger databases (~5 TB, 2,300+ tables, 31,000+ individual
attributes) I've built some queries which all start with the following CTEs:

WITH stats1 AS (
SELECT *,
current_setting('default_statistics_target')::int stats_target,

array_length(most_common_vals,1) AS num_mcv,
(SELECT SUM(f) FROM UNNEST(most_common_freqs) AS f) AS mcv_frac,

array_length(histogram_bounds,1) AS num_hist,
(SELECT COUNT(DISTINCT h)
FROM UNNEST(histogram_bounds::text::text[]) AS h) AS
distinct_hist

FROM pg_stats
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
),
stats2 AS (
SELECT *,
distinct_hist::float/num_hist AS hist_ratio
FROM stats1
)

The idea here is to collect the number of distinct values in the histogram
bounds vs. the total number of bounds. One of the reasons why there might
be duplicates in the histogram is a fully occupied MCV list, so we collect
the number of MCVs as well, in order to compare it with the stats_target.

These queries assume that all columns use the default statistics target,
which was actually the case with the database where I was testing this.
(It is straightforward to include per-column stats target in the picture,
but to do that efficiently, one will have to copy over the definition of
pg_stats view in order to access pg_attribute.attstattarget, and also will
have to run this as superuser to access pg_statistic. I wanted to keep the
queries simple to make it easier for other interested people to run these
queries in their environment, which quite likely excludes superuser
access. The more general query is also included[3]To include per-attribute stattarget, replace the reference to pg_stats view with the following CTE:.)

With the CTEs shown above it was easy to assess the following
"meta-statistics":

WITH ...
SELECT count(1),
min(hist_ratio)::real,
avg(hist_ratio)::real,
max(hist_ratio)::real,
stddev(hist_ratio)::real
FROM stats2
WHERE histogram_bounds IS NOT NULL;

-[ RECORD 1 ]----
count | 18980
min | 0.181818
avg | 0.939942
max | 1
stddev | 0.143189

That doesn't look too bad, but the min value is pretty fishy already. If I
would run the same query, limiting the scope to non-fully-unique
histograms, with "WHERE distinct_hist < num_hist" instead, the picture
would be a bit different:

-[ RECORD 1 ]----
count | 3724
min | 0.181818
avg | 0.693903
max | 0.990099
stddev | 0.170845

It shows that about 20% of all analyzed columns that have a histogram
(3724/18980), also have some duplicates in it, and that they have only
about 70% of distinct sample values on average.

Select statistics examples
==========================

Apart from mere aggregates it is interesting to look at some specific
MCVs/histogram examples. The following query is aimed to reconstruct
values of certain variables present in analyze.c code of
compute_scalar_stats() (again, with the same CTEs as above):

WITH ...
SELECT schemaname ||'.'|| tablename ||'.'|| attname || (CASE inherited WHEN
TRUE THEN ' (inherited)' ELSE '' END) AS columnname,
n_distinct, null_frac,
num_mcv, most_common_vals, most_common_freqs,
mcv_frac, (mcv_frac / (1 - null_frac))::real AS nonnull_mcv_frac,
distinct_hist, num_hist, hist_ratio,
histogram_bounds
FROM stats2
ORDER BY hist_ratio
LIMIT 1;

And the worst case as shown by this query (real values replaced with
placeholders):

columnname | xxx.yy1.zz1
n_distinct | 22
null_frac | 0.9893
num_mcv |
most_common_vals |
most_common_freqs |
mcv_frac |
nonnull_mcv_frac |
distinct_hist | 4
num_hist | 22
hist_ratio | 0.181818181818182
histogram_bounds |
{aaaa,bbb,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,DDD,zzz}

If one pays attention to the value of null_frac here, it should become
apparent that it is the reason of such unexpected picture.

The code in compute_scalar_stats() goes in such a way that it requires a
candidate MCV to count over "samplerows / max(ndistinct / 1.25, num_bins)",
but samplerows doesn't account for NULLs, this way we are rejecting
would-be MCVs on a wrong basis.

Another, one of the next-to-worst cases (the type of this column is
actually TEXT, hence the sort order):

columnname | xxx.yy2.zz2 (inherited)
n_distinct | 30
null_frac | 0
num_mcv | 2
most_common_vals | {101,100}
most_common_freqs | {0.806367,0.1773}
mcv_frac | 0.983667
nonnull_mcv_frac | 0.983667
distinct_hist | 6
num_hist | 28
hist_ratio | 0.214285714285714
histogram_bounds | {202,202,202,202,202,202,202,202,202, *3001*,
302,302,302,302,302,302,302,302,302,302,302, *3031,3185*,
502,502,502,502,502}

(emphasis added around values that are unique)

This time there were no NULLs, but still a lot of duplicates got included
in the histogram. This happens because there's relatively small number of
distinct values, so the candidate MCVs are limited by "samplerows /
num_bins". To really avoid duplicates in the histogram, what we want here
is "samplerows / (num_hist - 1)", but this brings a "chicken and egg"
problem: we don't know the value of num_hist before we determine the number
of MCVs we want to keep.

Solution proposal
=================

The solution I've come up with (and that was also suggested by Jeff Janes
in that performance thread mentioned above, as I have now found out) is to
move the calculation of the target number of histogram bins inside the loop
that evaluates candidate MCVs. At the same time we should account for
NULLs more accurately and subtract the MCV counts that we do include in the
list, from the count of samples left for the histogram on every loop
iteration. A patch implementing this approach is attached.

When I have re-analyzed the worst-case columns with the patch applied, I've
got the following picture (some boring stuff in the histogram snipped):

columnname | xxx.yy1.zz1
n_distinct | 21
null_frac | 0.988333
num_mcv | 5
most_common_vals | {DDD,"Rrrrr","Ddddddd","Kkkkkk","Rrrrrrrrrrrrrr"}
most_common_freqs | {0.0108333,0.0001,6.66667e-05,6.66667e-05,6.66667e-05}
mcv_frac | 0.0111333
nonnull_mcv_frac | 0.954287
distinct_hist | 16
num_hist | 16
hist_ratio | 1
histogram_bounds | {aaa,bbb,cccccccc,dddd,"dddd ddddd",Eee,...,zzz}

Now the "DDD" value was treated like an MCV as it should be. This arguably
constitutes a bug fix, the rest are probably just improvements.

Here we also see some additional MCVs, which are much less common. Maybe
we should also cut them off at some low frequency, but it seems hard to
draw a line. I can see that array_typanalyze.c uses a different approach
with frequency based cut-off, for example.

The mentioned next-to-worst case becomes:

columnname | xxx.yy2.zz2 (inherited)
n_distinct | 32
null_frac | 0
num_mcv | 15
most_common_vals |
{101,100,302,202,502,3001,3059,3029,3031,3140,3041,3095,3100,3102,3192}
most_common_freqs |
{0.803933,0.179,0.00656667,0.00526667,0.00356667,0.000333333,0.000133333,0.0001,0.0001,0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}
mcv_frac | 0.999433
nonnull_mcv_frac | 0.999433
distinct_hist | 17
num_hist | 17
hist_ratio | 1
histogram_bounds |
{3007,3011,3027,3056,3067,3073,3084,3087,3088,3106,3107,3118,3134,3163,3204,3225,3247}

Now we don't have any duplicates in the histogram at all, and we still have
a lot of diversity, i.e. the histogram didn't collapse.

The "meta-statistics" also improves if we re-analyze the whole database
with the patch:

(WHERE histogram_bounds IS NOT NULL)
-[ RECORD 1 ]----
count | 18314
min | 0.448276
avg | 0.988884
max | 1
stddev | 0.052899

(WHERE distinct_hist < num_hist)
-[ RECORD 1 ]----
count | 1095
min | 0.448276
avg | 0.81408
max | 0.990099
stddev | 0.119637

We could see that both worst case and average have improved. We did lose
some histograms because they have collapsed, but it only constitutes about
3.5% of total count.

We could also see that 100% of instances where we still have duplicates in
the histogram are due to the MCV lists being fully occupied:

WITH ...
SELECT count(1), num_mcv = stats_target
FROM stats2
WHERE distinct_hist < num_hist
GROUP BY 2;

-[ RECORD 1 ]--
count | 1095
?column? | t

There's not much left to do but increase statistics target for these
columns (or globally).

Increasing stats target
========================

It can be further demonstrated, that with the current code, there are
certain data distributions where increasing statistics target doesn't help,
in fact for a while it makes things only worse.

I've taken one of the tables where we hit the stats_target limit on the MCV
and tried to increase the target gradually, re-analyzing and taking notes
of MCV/histogram distribution. The table has around 1.7M rows, but only
950 distinct values in one of the columns.

The results of analyzing the table with different statistics target is as
follows:

stats_target | num_mcv | mcv_frac | num_hist | hist_ratio
--------------+---------+----------+----------+------------
100 | 97 | 0.844133 | 101 | 0.841584
200 | 106 | 0.872267 | 201 | 0.771144
300 | 108 | 0.881244 | 301 | 0.528239
400 | 111 | 0.885992 | 376 | 0.422872
500 | 112 | 0.889973 | 411 | 0.396594
1000 | 130 | 0.914046 | 550 | 0.265455
1500 | 167 | 0.938638 | 616 | 0.191558
1625 | 230 | 0.979393 | 570 | 0.161404
1750 | 256 | 0.994913 | 564 | 0.416667
2000 | 260 | 0.996998 | 601 | 0.63228

One can see that num_mcv grows very slowly, while hist_ratio drops
significantly and starts growing back only after around stats_target = 1700
in this case. And in order to hit hist_ratio of 100% one would have to
analyze the whole table as a "sample", which implies statistics target of
around 6000.

If I would do the same test with the patched version, the picture would be
dramatically different:

stats_target | num_mcv | mcv_frac | num_hist | hist_ratio
--------------+---------+----------+----------+------------
100 | 100 | 0.849367 | 101 | 0.881188
200 | 200 | 0.964367 | 187 | 0.390374
300 | 294 | 0.998445 | 140 | 1
400 | 307 | 0.998334 | 200 | 1
500 | 329 | 0.998594 | 211 | 1

The MCV list tends to take all available space in this case and hist_ratio
hits 100% (after a short drop) and it only requires a small increase in
stats_target, compared to the unpatched version. Even if only for this
particular distribution pattern, that constitutes an improvement, I believe.

By increasing default_statistics_target on a patched version, I could
verify that the number of instances with duplicates in the histogram due to
the full MCV lists, which was 1095 at target 100 (see the latest query in
prev. section) can be further reduced to down to ~650 at target 500, then
~300 at target 1000. Apparently there always would be distributions which
cannot be covered by increasing stats target, but given that histogram
fraction also decreases rather dramatically, this should not bother us a
lot.

Expected effects on plan stability
==================================

We didn't have a chance to test this change in production, but according to
a research by my colleague Stefan Litsche, which is summarized in his blog
post[1]https://tech.zalando.com/blog/analyzing-extreme-distributions-in-postgresql/ (and as it was also presented on PGConf.de last week), with the
existing code increasing stats target doesn't always help immediately: in
fact for certain distributions it leads to less accurate statistics at
first.

If you look at the last graph in that blog, you can see that for very rare
values one needs to increase statistics target really high in order to get
the rare value covered by statistics reliably. Attached is the same graph,
produced from the patched version of the server: now the average number of
MCVs for the discussed test distribution shows monotonic increase when
increasing statistics target. Also, for this particular case the
statistics target can be increased by a much smaller factor in order to get
a good sample coverage.

A few word about the patch
==========================

+                     /* estimate # of occurrences in sample of a typical
value */
+                     avgcount = (double) sample_cnt / (double) ndistinct;

Here, ndistinct is the int-type variable, as opposed to the original code
where it is re-declared at an inner scope with type double and hides the
one from the outer scope. Both sample_cnt and ndistinct decrease with
every iteration of the loop, and this average calculation is actually
accurate: it's not an estimate.

+
+                     /* set minimum threshold count to store a value */
+                     mincount = 1.25 * avgcount;

I'm not fond of arbitrary magic coefficients, so I would drop this
altogether, but here it goes to make the difference less striking compared
to the original code.

I've removed the "mincount < 2" check that was present in the original
code, because track[i].count is always >= 2, so it doesn't make a
difference if we bump it up to 2 here.

+
+                     /* don't let threshold exceed 1/K, however */
+                     maxmincount = (sample_cnt - 1) / (double) (num_hist -
1);

Finally, here we cannot set the threshold any higher. As an illustration,
consider the following sample (assuming statistics target is > 1):

[A A A A A B B B B B C]

sample_cnt = 11
ndistinct = 3
num_hist = 3

Then maxmincount calculates to (11 - 1) / (3 - 1) = 10 / 2.0 = 5.0.

If we would require the next most common value to be even a tiny bit more
popular than this threshold, we would not take the A-sample to the MCV list
which we clearly should.

The "take all MCVs" condition
=============================

Finally, the conditions to include all tracked duplicate samples into MCV
list look like not all that easy to hit:

- there must be no other non-duplicate values (track_cnt == ndistinct)
- AND there must be no too-wide values, not even a single byte too wide
(toowide_cnt == 0)
- AND the estimate of total number of distinct values must not exceed 10%
of the total rows in table (stats->stadistinct > 0)

The last condition (track_cnt <= num_mcv) is duplicated from
compute_distinct_stats() (former compute_minimal_stats()), but the
condition always holds in compute_scalar_stats(), since we never increment
track_cnt past num_mcv in this variant of the function.

Each of the above conditions introduces a bifurcation point where one of
two very different code paths could be taken. Even though the conditions
are strict, with the proposed patch there is no need to relax them if we
want to achieve better sample value distribution, because the complex code
path has more predictable behavior now.

In conclusion: this code has not been touched for almost 15 years[2]http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b67fc0079cf1f8db03aaa6d16f0ab8bd5d1a240d and it
might be scary to change anything, but I believe the evidence presented
above makes a pretty good argument to attempt improving it.

Thanks for reading, your comments are welcome!

--
Alex

[1]: https://tech.zalando.com/blog/analyzing-extreme-distributions-in-postgresql/
https://tech.zalando.com/blog/analyzing-extreme-distributions-in-postgresql/

[2]: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b67fc0079cf1f8db03aaa6d16f0ab8bd5d1a240d
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b67fc0079cf1f8db03aaa6d16f0ab8bd5d1a240d

b67fc0079cf1f8db03aaa6d16f0ab8bd5d1a240d
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Jun 6 21:29:17 2001 +0000

Be a little smarter about deciding how many most-common values to save.

[3]: To include per-attribute stattarget, replace the reference to pg_stats view with the following CTE:
view with the following CTE:

WITH stats0 AS (
SELECT s.*,
(CASE a.attstattarget
WHEN -1 THEN current_setting('default_statistics_target')::int
ELSE a.attstattarget END) AS stats_target,

-- from pg_stats view
n.nspname AS schemaname,
c.relname AS tablename,
a.attname,
s.stainherit AS inherited,
s.stanullfrac AS null_frac,
s.stadistinct AS n_distinct,
CASE
WHEN s.stakind1 = 1 THEN s.stavalues1
WHEN s.stakind2 = 1 THEN s.stavalues2
WHEN s.stakind3 = 1 THEN s.stavalues3
WHEN s.stakind4 = 1 THEN s.stavalues4
WHEN s.stakind5 = 1 THEN s.stavalues5
ELSE NULL::anyarray
END AS most_common_vals,
CASE
WHEN s.stakind1 = 1 THEN s.stanumbers1
WHEN s.stakind2 = 1 THEN s.stanumbers2
WHEN s.stakind3 = 1 THEN s.stanumbers3
WHEN s.stakind4 = 1 THEN s.stanumbers4
WHEN s.stakind5 = 1 THEN s.stanumbers5
ELSE NULL::real[]
END AS most_common_freqs,
CASE
WHEN s.stakind1 = 2 THEN s.stavalues1
WHEN s.stakind2 = 2 THEN s.stavalues2
WHEN s.stakind3 = 2 THEN s.stavalues3
WHEN s.stakind4 = 2 THEN s.stavalues4
WHEN s.stakind5 = 2 THEN s.stavalues5
ELSE NULL::anyarray
END AS histogram_bounds

FROM stats_dump_original.pg_statistic AS s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace

WHERE NOT a.attisdropped
),

Attachments:

analyze-better-histogram.patchtext/x-patch; charset=US-ASCII; name=analyze-better-histogram.patchDownload+56-40
MCV-distribution-patched.pngimage/png; name=MCV-distribution-patched.pngDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#1)
Re: More stable query plans via more predictable column statistics

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.

Please add this to the 2016-01 commitfest ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tom Lane (#2)
Re: More stable query plans via more predictable column statistics

On Tue, Dec 1, 2015 at 7:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some

real-world

data and proposes a patch to rectify some of the oddities observed.

Please add this to the 2016-01 commitfest ...

Added: https://commitfest.postgresql.org/8/434/

#4Robert Haas
robertmhaas@gmail.com
In reply to: Shulgin, Oleksandr (#1)
Re: More stable query plans via more predictable column statistics

On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

Hi Hackers!

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.

Introduction
============

We have observed that for certain data sets the distribution of samples
between most_common_vals and histogram_bounds can be unstable: so that it
may change dramatically with the next ANALYZE run, thus leading to radically
different plans.

I was revisiting the following performance thread and I've found some
interesting details about statistics in our environment:

/messages/by-id/CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com

My initial interest was in evaluation if distribution of samples could be
made more predictable and less dependent on the factor of luck, thus leading
to more stable execution plans.

Unexpected findings
===================

What I have found is that in a significant percentage of instances, when a
duplicate sample value is *not* put into the MCV list, it does produce
duplicates in the histogram_bounds, so it looks like the MCV cut-off happens
too early, even though we have enough space for more values in the MCV list.

In the extreme cases I've found completely empty MCV lists and histograms
full of duplicates at the same time, with only about 20% of distinct values
in the histogram (as it turns out, this happens due to high fraction of
NULLs in the sample).

Wow, this is very interesting work. Using values_cnt rather than
samplerows to compute avgcount seems like a clear improvement. It
doesn't make any sense to raise the threshold for creating an MCV
based on the presence of additional nulls or too-wide values in the
table. I bet compute_distinct_stats needs a similar fix. But for
plan stability considerations, I'd say we should back-patch this all
the way, but those considerations might mitigate for a more restrained
approach. Still, maybe we should try to sneak at least this much into
9.5 RSN, because I have to think this is going to help people with
mostly-NULL (or mostly-really-wide) columns.

As far as the rest of the fix, your code seems to remove the handling
for ndistinct < 0. That seems unlikely to be correct, although it's
possible that I am missing something. Aside from that, the rest of
this seems like a policy change, and I'm not totally sure off-hand
whether it's the right policy. Having more MCVs can increase planning
time noticeably, and the point of the existing cutoff is to prevent us
from choosing MCVs that aren't actually "C". I think this change
significantly undermines those protections. It seems to me that it
might be useful to evaluate the effects of this part of the patch
separately from the samplerows -> values_cnt change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#4)
Re: More stable query plans via more predictable column statistics

Robert Haas <robertmhaas@gmail.com> writes:

Still, maybe we should try to sneak at least this much into
9.5 RSN, because I have to think this is going to help people with
mostly-NULL (or mostly-really-wide) columns.

Please no. We are trying to get to release, not destabilize things.

I think this is fine work for leisurely review and incorporation into
9.6. It's not appropriate to rush it into 9.5 at the RC stage after
minimal review.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Robert Haas (#4)
Re: More stable query plans via more predictable column statistics

On Fri, Dec 4, 2015 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

What I have found is that in a significant percentage of instances, when

a

duplicate sample value is *not* put into the MCV list, it does produce
duplicates in the histogram_bounds, so it looks like the MCV cut-off

happens

too early, even though we have enough space for more values in the MCV

list.

In the extreme cases I've found completely empty MCV lists and histograms
full of duplicates at the same time, with only about 20% of distinct

values

in the histogram (as it turns out, this happens due to high fraction of
NULLs in the sample).

Wow, this is very interesting work. Using values_cnt rather than
samplerows to compute avgcount seems like a clear improvement. It
doesn't make any sense to raise the threshold for creating an MCV
based on the presence of additional nulls or too-wide values in the
table. I bet compute_distinct_stats needs a similar fix.

Yes, and there's also the magic 1.25 multiplier in that code. I think it
would make sense to agree first on how exactly the patch for
compute_scalar_stats() should look like, then port the relevant bits to
compute_distinct_stats().

But for
plan stability considerations, I'd say we should back-patch this all
the way, but those considerations might mitigate for a more restrained
approach. Still, maybe we should try to sneak at least this much into
9.5 RSN, because I have to think this is going to help people with
mostly-NULL (or mostly-really-wide) columns.

I'm not sure. Likely people would complain or have found this out on their
own if they were seriously affected.

What I would be interested is people running the queries I've shown on
their data to see if there are any interesting/unexpected patterns.

As far as the rest of the fix, your code seems to remove the handling

for ndistinct < 0. That seems unlikely to be correct, although it's
possible that I am missing something.

The difference here is that ndistinct at this scope in the original code
did hide a variable from an outer scope. That one could be < 0, but in my
code there is no inner-scope ndistinct, we are referring to the outer scope
variable which cannot be < 0.

Aside from that, the rest of
this seems like a policy change, and I'm not totally sure off-hand
whether it's the right policy. Having more MCVs can increase planning
time noticeably, and the point of the existing cutoff is to prevent us
from choosing MCVs that aren't actually "C". I think this change
significantly undermines those protections. It seems to me that it
might be useful to evaluate the effects of this part of the patch
separately from the samplerows -> values_cnt change.

Yes, that's why I was wondering if frequency cut-off approach might be
helpful here. I'm going to have a deeper look at array's typanalyze
implementation at the least.

--
Alex

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#5)
Re: More stable query plans via more predictable column statistics

On Fri, Dec 4, 2015 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Still, maybe we should try to sneak at least this much into
9.5 RSN, because I have to think this is going to help people with
mostly-NULL (or mostly-really-wide) columns.

Please no. We are trying to get to release, not destabilize things.

Well, OK, but I don't really see how that particular bit is anything
other than a bug fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#3)
Re: More stable query plans via more predictable column statistics

On Wed, Dec 2, 2015 at 10:20 AM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Tue, Dec 1, 2015 at 7:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some

real-world

data and proposes a patch to rectify some of the oddities observed.

Please add this to the 2016-01 commitfest ...

Added: https://commitfest.postgresql.org/8/434/

It would be great if some folks could find a moment to run the queries I
was showing on their data to confirm (or refute) my findings, or to
contribute to the picture in general.

As I was saying, the queries were designed in such a way that even
unprivileged user can run them (the results will be limited to the stats
data available to that user, obviously; and for custom-tailored statstarget
one still needs superuser to join the pg_statistic table directly). Also,
on the scale of ~30k attribute statistics records, the queries take only a
few seconds to finish.

Cheers!
--
Alex

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: More stable query plans via more predictable column statistics

Tom Lane wrote:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.

Please add this to the 2016-01 commitfest ...

Tom, are you reviewing this for the current commitfest?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#9)
Re: More stable query plans via more predictable column statistics

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.

Please add this to the 2016-01 commitfest ...

Tom, are you reviewing this for the current commitfest?

Um, I would like to review it, but I doubt I'll find time before the end
of the month.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#9)
Re: More stable query plans via more predictable column statistics

Hi,

On 01/20/2016 10:49 PM, Alvaro Herrera wrote:

Tom Lane wrote:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.

Please add this to the 2016-01 commitfest ...

Tom, are you reviewing this for the current commitfest?

While I'm not the right Tom, I've been looking the the patch recently,
so let me post the review here ...

Firstly, I'd like to appreciate the level of detail of the analysis. I
may disagree with some of the conclusions, but I wish all my patch
submissions were of such high quality.

Regarding the patch itself, I think there's a few different points
raised, so let me discuss them one by one:

1) NULLs vs. MCV threshold
--------------------------

I agree that this seems like a bug, and that we should really compute
the threshold only using non-NULL values. I think the analysis rather
conclusively proves this, and I also think there are places where we do
the same mistake (more on that at the end of the review).

2) mincount = 1.25 * avgcount
-----------------------------

While I share the dislike of arbitrary constants (like the 1.25 here), I
do think we better keep this, otherwise we can just get rid of the
mincount entirely I think - the most frequent value will always be above
the (remaining) average count, making the threshold pointless.

It might have impact in the original code, but in the new one it's quite
useless (see the next point), unless I'm missing some detail.

3) modifying avgcount threshold inside the loop
-----------------------------------------------

The comment was extended with this statement:

* We also decrease ndistinct in the process such that going forward
* it refers to the number of distinct values left for the histogram.

and indeed, that's what's happening - at the end of each loop, we do this:

/* Narrow our view of samples left for the histogram */
sample_cnt -= track[i].count;
ndistinct--;

but that immediately lowers the avgcount, as that's re-evaluated within
the same loop

avgcount = (double) sample_cnt / (double) ndistinct;

which means it's continuously decreasing and lowering the threshold,
although that's partially mitigated by keeping the 1.25 coefficient.

It however makes reasoning about the algorithm much more complicated.

4) for (i = 0; /* i < num_mcv */; i++)
---------------------------------------

The way the loop is coded seems rather awkward, I guess. Not only
there's an unexpected comment in the "for" clause, but the condition
also says this

/* Another way to say "while (i < num_mcv)" */
if (i >= num_mcv)
break;

Why not to write it as a while loop, then? Possibly including the
(num_hist >= 2) condition, like this:

while ((i < num_mcv) && (num_hist >= 2))
{
...
}

In any case, the loop would deserve a comment explaining why we think
computing the thresholds like this makes sense.

Summary
-------

Overall, I think this is really about deciding when to cut-off the MCV,
so that it does not grow needlessly large - as Robert pointed out, the
larger the list, the more expensive the estimation (and thus planning).

So even if we could fit the whole sample into the MCV list (i.e. we
believe we've seen all the values and we can fit them into the MCV
list), it may not make sense to do so. The ultimate goal is to estimate
conditions, and if we can do that reasonably even after cutting of the
least frequent values from the MCV list, then why not?

From this point of view, the analysis concentrates deals just with the
ANALYZE part and does not discuss the estimation counter-part at all.

5) ndistinct estimation vs. NULLs
---------------------------------

While looking at the patch, I started realizing whether we're actually
handling NULLs correctly when estimating ndistinct. Because that part
also uses samplerows directly and entirely ignores NULLs, as it does this:

numer = (double) samplerows *(double) d;

denom = (double) (samplerows - f1) +
(double) f1 *(double) samplerows / totalrows;

...
if (stadistinct > totalrows)
stadistinct = totalrows;

For tables with large fraction of NULLs, this seems to significantly
underestimate the ndistinct value - for example consider a trivial table
with 95% of NULL values and ~10k distinct values with skewed distribution:

create table t (id int);

insert into t
select (case when random() < 0.05 then (10000 * random() * random())
else null end) from generate_series(1,1000000) s(i);

In practice, there are 8325 distinct values in my sample:

test=# select count(distinct id) from t;
count
-------
8325
(1 row)

But after ANALYZE with default statistics target (100), ndistinct is
estimated to be ~1300, and with target=1000 the estimate increases to ~6000.

After fixing the estimator to consider fraction of NULLs, the estimates
look like this:

statistics target | master | patched
------------------------------------------
100 | 1302 | 5356
1000 | 6022 | 6791

So this seems to significantly improve the ndistinct estimate (patch
attached).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

ndistinct-nulls-fix.patchapplication/x-patch; name=ndistinct-nulls-fix.patchDownload+9-5
#12Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tomas Vondra (#11)
Re: More stable query plans via more predictable column statistics

On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

Hi,

On 01/20/2016 10:49 PM, Alvaro Herrera wrote:

Tom, are you reviewing this for the current commitfest?

While I'm not the right Tom, I've been looking the the patch recently, so
let me post the review here ...

Thank you for the review!

2) mincount = 1.25 * avgcount

-----------------------------

While I share the dislike of arbitrary constants (like the 1.25 here), I
do think we better keep this, otherwise we can just get rid of the mincount
entirely I think - the most frequent value will always be above the
(remaining) average count, making the threshold pointless.

Correct.

It might have impact in the original code, but in the new one it's quite

useless (see the next point), unless I'm missing some detail.

3) modifying avgcount threshold inside the loop
-----------------------------------------------

The comment was extended with this statement:

* We also decrease ndistinct in the process such that going forward
* it refers to the number of distinct values left for the histogram.

and indeed, that's what's happening - at the end of each loop, we do this:

/* Narrow our view of samples left for the histogram */
sample_cnt -= track[i].count;
ndistinct--;

but that immediately lowers the avgcount, as that's re-evaluated within
the same loop

avgcount = (double) sample_cnt / (double) ndistinct;

which means it's continuously decreasing and lowering the threshold,
although that's partially mitigated by keeping the 1.25 coefficient.

I was going to write "not necessarily lowering", but this is actually
accurate. The following holds due to track[i].count > avgcount (=
sample_cnt / ndistinct):

sample_cnt sample_cnt - track[i].count
------------ > -----------------------------
ndistinct ndistinct - 1

It however makes reasoning about the algorithm much more complicated.

Unfortunately, yes.

4) for (i = 0; /* i < num_mcv */; i++)

---------------------------------------

The way the loop is coded seems rather awkward, I guess. Not only there's
an unexpected comment in the "for" clause, but the condition also says this

/* Another way to say "while (i < num_mcv)" */
if (i >= num_mcv)
break;

Why not to write it as a while loop, then? Possibly including the
(num_hist >= 2) condition, like this:

while ((i < num_mcv) && (num_hist >= 2))
{
...
}

In any case, the loop would deserve a comment explaining why we think
computing the thresholds like this makes sense.

This is partially explained by a comment inside the loop:

! for (i = 0; /* i < num_mcv */; i++)
{
! /*
! * We have to put this before the loop condition, otherwise
! * we'll have to repeat this code before the loop and after
! * decreasing ndistinct.
! */
! num_hist = ndistinct;
! if (num_hist > num_bins)
! num_hist = num_bins + 1;

I guess this is a case where code duplication can be traded for more
apparent control flow, i.e:

! num_hist = ndistinct;
! if (num_hist > num_bins)
! num_hist = num_bins + 1;

!  for (i = 0; i < num_mcv && num_hist >= 2; i++)
   {
...
+ /* Narrow our view of samples left for the histogram */
+ sample_cnt -= track[i].count;
+ ndistinct--;
+
+  num_hist = ndistinct;
+  if (num_hist > num_bins)
+  num_hist = num_bins + 1;
   }

Summary

-------

Overall, I think this is really about deciding when to cut-off the MCV, so
that it does not grow needlessly large - as Robert pointed out, the larger
the list, the more expensive the estimation (and thus planning).

So even if we could fit the whole sample into the MCV list (i.e. we
believe we've seen all the values and we can fit them into the MCV list),
it may not make sense to do so. The ultimate goal is to estimate
conditions, and if we can do that reasonably even after cutting of the
least frequent values from the MCV list, then why not?

From this point of view, the analysis concentrates deals just with the
ANALYZE part and does not discuss the estimation counter-part at all.

True, this aspect still needs verification. As stated, my primary
motivation was to improve the plan stability for relatively short MCV lists.

Longer MCV lists might be a different story, but see "Increasing stats
target" section of the original mail: increasing the target doesn't give
quite the expected results with unpatched code either.

5) ndistinct estimation vs. NULLs

---------------------------------

While looking at the patch, I started realizing whether we're actually
handling NULLs correctly when estimating ndistinct. Because that part also
uses samplerows directly and entirely ignores NULLs, as it does this:

numer = (double) samplerows *(double) d;

denom = (double) (samplerows - f1) +
(double) f1 *(double) samplerows / totalrows;

...
if (stadistinct > totalrows)
stadistinct = totalrows;

For tables with large fraction of NULLs, this seems to significantly
underestimate the ndistinct value - for example consider a trivial table
with 95% of NULL values and ~10k distinct values with skewed distribution:

create table t (id int);

insert into t
select (case when random() < 0.05 then (10000 * random() * random())
else null end) from generate_series(1,1000000) s(i);

In practice, there are 8325 distinct values in my sample:

test=# select count(distinct id) from t;
count
-------
8325
(1 row)

But after ANALYZE with default statistics target (100), ndistinct is
estimated to be ~1300, and with target=1000 the estimate increases to ~6000.

After fixing the estimator to consider fraction of NULLs, the estimates
look like this:

statistics target | master | patched
------------------------------------------
100 | 1302 | 5356
1000 | 6022 | 6791

So this seems to significantly improve the ndistinct estimate (patch
attached).

Hm... this looks correct. And compute_distinct_stats() needs the same
treatment, obviously.

--
Alex

#13Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#12)
Re: More stable query plans via more predictable column statistics

On Mon, Jan 25, 2016 at 5:11 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <

tomas.vondra@2ndquadrant.com> wrote:

Overall, I think this is really about deciding when to cut-off the MCV,

so that it does not grow needlessly large - as Robert pointed out, the
larger the list, the more expensive the estimation (and thus planning).

So even if we could fit the whole sample into the MCV list (i.e. we

believe we've seen all the values and we can fit them into the MCV list),
it may not make sense to do so. The ultimate goal is to estimate
conditions, and if we can do that reasonably even after cutting of the
least frequent values from the MCV list, then why not?

From this point of view, the analysis concentrates deals just with the

ANALYZE part and does not discuss the estimation counter-part at all.

True, this aspect still needs verification. As stated, my primary

motivation was to improve the plan stability for relatively short MCV lists.

Longer MCV lists might be a different story, but see "Increasing stats

target" section of the original mail: increasing the target doesn't give
quite the expected results with unpatched code either.

To address this concern I've run my queries again on the same dataset, now
focusing on how the number of MCV items changes with the patched code
(using the CTEs from my original mail):

WITH ...

SELECT count(1),
min(num_mcv)::real,
avg(num_mcv)::real,
max(num_mcv)::real,
stddev(num_mcv)::real

FROM stats2

WHERE num_mcv IS NOT NULL;

(ORIGINAL)
count | 27452
min | 1
avg | 32.7115
max | 100
stddev | 40.6927

(PATCHED)
count | 27527
min | 1
avg | 38.4341
max | 100
stddev | 43.3596

A significant portion of the MCV lists is occupying all 100 slots available
with the default statistics target, so it also interesting to look at the
stats that habe "underfilled" MCV lists (by changing the condition of the
WHERE clause to read "num_mcv < 100"):

(<100 ORIGINAL)
count | 20980
min | 1
avg | 11.9541
max | 99
stddev | 18.4132

(<100 PATCHED)
count | 19329
min | 1
avg | 12.3222
max | 99
stddev | 19.6959

As one can see, with the patched code the average length of MCV lists
doesn't change all that dramatically, while at the same time exposing all
the improvements described in the original mail.

After fixing the estimator to consider fraction of NULLs, the estimates

look like this:

statistics target | master | patched
------------------------------------------
100 | 1302 | 5356
1000 | 6022 | 6791

So this seems to significantly improve the ndistinct estimate (patch

attached).

Hm... this looks correct. And compute_distinct_stats() needs the same

treatment, obviously.

I've incorporated this fix into the v2 of my patch, I think it is related
closely enough. Also, added corresponding changes to
compute_distinct_stats(), which doesn't produce a histogram.

I'm adding this to the next CommitFest. Further reviews are very much
appreciated!

--
Alex

Attachments:

analyze-better-histogram-v2.patchtext/x-patch; charset=US-ASCII; name=analyze-better-histogram-v2.patchDownload+105-90
#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#13)
Re: More stable query plans via more predictable column statistics

Hi,

On 02/08/2016 03:01 PM, Shulgin, Oleksandr wrote:

...

I've incorporated this fix into the v2 of my patch, I think it is
related closely enough. Also, added corresponding changes to
compute_distinct_stats(), which doesn't produce a histogram.

I think it'd be useful not to have all the changes in one lump, but
structure this as a patch series with related changes in separate
chunks. I doubt we'd like to mix the changes in a single commit, and it
makes the reviews and reasoning easier. So those NULL-handling fixes
should be in one patch, the MCV patches in another one.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tomas Vondra (#14)
Re: More stable query plans via more predictable column statistics

On Wed, Feb 24, 2016 at 12:30 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

Hi,

On 02/08/2016 03:01 PM, Shulgin, Oleksandr wrote:

...

I've incorporated this fix into the v2 of my patch, I think it is
related closely enough. Also, added corresponding changes to
compute_distinct_stats(), which doesn't produce a histogram.

I think it'd be useful not to have all the changes in one lump, but
structure this as a patch series with related changes in separate chunks. I
doubt we'd like to mix the changes in a single commit, and it makes the
reviews and reasoning easier. So those NULL-handling fixes should be in one
patch, the MCV patches in another one.

OK, such a split would make sense to me. Though, I'm a bit late as the
commitfest is already closed to new patches, I guess asking the CF manager
to split this might work (assuming I produce the patch files)?

--
Alex

#16David Steele
david@pgmasters.net
In reply to: Shulgin, Oleksandr (#15)
Re: More stable query plans via more predictable column statistics

On 3/2/16 11:10 AM, Shulgin, Oleksandr wrote:

On Wed, Feb 24, 2016 at 12:30 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

I think it'd be useful not to have all the changes in one lump, but
structure this as a patch series with related changes in separate
chunks. I doubt we'd like to mix the changes in a single commit, and
it makes the reviews and reasoning easier. So those NULL-handling
fixes should be in one patch, the MCV patches in another one.

OK, such a split would make sense to me. Though, I'm a bit late as the
commitfest is already closed to new patches, I guess asking the CF
manager to split this might work (assuming I produce the patch files)?

If the patch is broken into two files that gives the review/committer
more options but I don't think it requires another CF entry.

--
-David
david@pgmasters.net

#17Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: David Steele (#16)
Re: More stable query plans via more predictable column statistics

On Wed, Mar 2, 2016 at 5:46 PM, David Steele <david@pgmasters.net> wrote:

On 3/2/16 11:10 AM, Shulgin, Oleksandr wrote:

On Wed, Feb 24, 2016 at 12:30 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>

wrote:

I think it'd be useful not to have all the changes in one lump, but
structure this as a patch series with related changes in separate
chunks. I doubt we'd like to mix the changes in a single commit, and
it makes the reviews and reasoning easier. So those NULL-handling
fixes should be in one patch, the MCV patches in another one.

OK, such a split would make sense to me. Though, I'm a bit late as the
commitfest is already closed to new patches, I guess asking the CF
manager to split this might work (assuming I produce the patch files)?

If the patch is broken into two files that gives the review/committer
more options but I don't think it requires another CF entry.

Alright. I'm attaching the latest version of this patch split in two
parts: the first one is NULLs-related bugfix and the second is the
"improvement" part, which applies on top of the first one.

--
Alex

Attachments:

0001-Account-for-NULLs-in-ANALYZE-more-strictly.patchtext/x-patch; charset=US-ASCII; name=0001-Account-for-NULLs-in-ANALYZE-more-strictly.patchDownload+24-19
0002-Try-to-account-for-skewed-distributions-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=0002-Try-to-account-for-skewed-distributions-in-ANALYZE.patchDownload+62-29
#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#17)
Re: More stable query plans via more predictable column statistics

Shulgin, Oleksandr wrote:

Alright. I'm attaching the latest version of this patch split in two
parts: the first one is NULLs-related bugfix and the second is the
"improvement" part, which applies on top of the first one.

So is this null-related bugfix supposed to be backpatched? (I assume
it's not because it's very likely to change existing plans).

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Alvaro Herrera (#18)
Re: More stable query plans via more predictable column statistics

On Wed, Mar 2, 2016 at 7:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Shulgin, Oleksandr wrote:

Alright. I'm attaching the latest version of this patch split in two
parts: the first one is NULLs-related bugfix and the second is the
"improvement" part, which applies on top of the first one.

So is this null-related bugfix supposed to be backpatched? (I assume
it's not because it's very likely to change existing plans).

For the good, because cardinality estimations will be more accurate in
these cases, so yes I would expect it to be back-patchable.

--
Alex

#20Robert Haas
robertmhaas@gmail.com
In reply to: Shulgin, Oleksandr (#19)
Re: More stable query plans via more predictable column statistics

On Thu, Mar 3, 2016 at 2:48 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

On Wed, Mar 2, 2016 at 7:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Shulgin, Oleksandr wrote:

Alright. I'm attaching the latest version of this patch split in two
parts: the first one is NULLs-related bugfix and the second is the
"improvement" part, which applies on top of the first one.

So is this null-related bugfix supposed to be backpatched? (I assume
it's not because it's very likely to change existing plans).

For the good, because cardinality estimations will be more accurate in these
cases, so yes I would expect it to be back-patchable.

-1. I think the cost of changing existing query plans in back
branches is too high. The people who get a better plan never thank
us, but the people who (by bad luck) get a worse plan always complain.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Robert Haas (#20)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#21)
#23Jeff Janes
jeff.janes@gmail.com
In reply to: Shulgin, Oleksandr (#21)
#24Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Jeff Janes (#23)
#25Joel Jacobson
joel@trustly.com
In reply to: Shulgin, Oleksandr (#8)
#26Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Joel Jacobson (#25)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#17)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
#29Joel Jacobson
joel@trustly.com
In reply to: Shulgin, Oleksandr (#26)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#28)
#31Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Joel Jacobson (#29)
#32Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Alvaro Herrera (#27)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#31)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#32)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#34)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#36)
#38Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tomas Vondra (#34)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#38)
#40Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tom Lane (#39)
#41Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#41)
#43Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tom Lane (#42)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#17)
#45Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tom Lane (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#45)
#47Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tom Lane (#46)
#48Alex Shulgin
alex.shulgin@gmail.com
In reply to: Shulgin, Oleksandr (#47)
#49Alex Shulgin
alex.shulgin@gmail.com
In reply to: Alex Shulgin (#48)
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Shulgin (#49)
#51Alex Shulgin
alex.shulgin@gmail.com
In reply to: Tom Lane (#50)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Shulgin (#51)
#53Alex Shulgin
alex.shulgin@gmail.com
In reply to: Tom Lane (#52)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Shulgin (#53)
#55Alex Shulgin
alex.shulgin@gmail.com
In reply to: Tom Lane (#54)
#56Alex Shulgin
alex.shulgin@gmail.com
In reply to: Alex Shulgin (#53)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Shulgin (#56)
#58Alex Shulgin
alex.shulgin@gmail.com
In reply to: Tom Lane (#57)
#59Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Shulgin (#58)
#60Alex Shulgin
alex.shulgin@gmail.com
In reply to: Tom Lane (#59)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Shulgin (#60)
#62Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tom Lane (#61)