MCV lists for highly skewed distributions
I want to revive a patch I sent couple years ago to the performance list,
as I have seen the issue pop up repeatedly since then.
Original thread is here:
/messages/by-id/CAK_s-G2tc8r0N3AqPr8fW5QsRQMbZNurgAQ=_ME1aaf4vOmnnA@mail.gmail.com
The problem is that if you have a highly skewed distribution, such as
exponential frequencies, it usually stores too few entries in the MCV list.
For example:
create table foo as select floor(-log(random())/log(2))::int as who from
generate_series(1,10000000);
This will usually store 0 to 5 as MCV with the default stat target[1]Occasionally it will store a much longer MCV list, because no values was sampled exactly once, which triggers a different code path in which all seen values are put in the MCV and the histogram is NULL. This is not reliable, as whether the least-sample value is present in the sample once or twice is pretty brittle..
Then value 6 is not stored, and its ~75k repetitions get averaged over the
remaining distinct values. This leads to horrible misplanning for rare
values. For example, who=17 has 37 entries, but is estimated to have
20,000. The correct join for 37 values is often not the correct one for
20,000.
If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.
So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.
As it is, you can partially overcome the too short MCV list by cranking up
the default statistics target, but this is a weak effect and comes at a
high cost of CPU time. In some of the real distributions I've looked at,
cranking up default statistics target is almost entirely ineffective.
I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one. After all, things
included in the MCV don' get represented in the histogram. When I've seen
planning problems due to skewed distributions I also usually see redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.
[1]: Occasionally it will store a much longer MCV list, because no values was sampled exactly once, which triggers a different code path in which all seen values are put in the MCV and the histogram is NULL. This is not reliable, as whether the least-sample value is present in the sample once or twice is pretty brittle.
was sampled exactly once, which triggers a different code path in which all
seen values are put in the MCV and the histogram is NULL. This is not
reliable, as whether the least-sample value is present in the sample once
or twice is pretty brittle.
Cheers,
Jeff
Attachments:
analyze_highly_skewed.patchtext/x-patch; charset=US-ASCII; name=analyze_highly_skewed.patchDownload
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f952b3c..b02ac20 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2575,7 +2575,8 @@ compute_scalar_stats(VacAttrStatsP stats,
* the MCV list is not complete, it's generally worth being more
* selective, and not just filling it all the way up to the stats
* target. So for an incomplete list, we try to take only MCVs that
- * are significantly more common than average.
+ * are significantly more common than average frequency for values which
+ * will not be represented in the MCV list.
*/
if (track_cnt == ndistinct && toowide_cnt == 0 &&
stats->stadistinct > 0 &&
@@ -2587,6 +2588,7 @@ compute_scalar_stats(VacAttrStatsP stats,
else
{
double ndistinct_table = stats->stadistinct;
+ double values_cnt_remaining = (double) values_cnt;
double avgcount,
mincount,
maxmincount;
@@ -2594,25 +2596,27 @@ compute_scalar_stats(VacAttrStatsP stats,
/* Re-extract estimate of # distinct nonnull values in table */
if (ndistinct_table < 0)
ndistinct_table = -ndistinct_table * totalrows;
- /* estimate # occurrences in sample of a typical nonnull value */
- avgcount = (double) nonnull_cnt / ndistinct_table;
- /* set minimum threshold count to store a value */
- mincount = avgcount * 1.25;
- if (mincount < 2)
- mincount = 2;
- /* don't let threshold exceed 1/K, however */
- maxmincount = (double) values_cnt / (double) num_bins;
- if (mincount > maxmincount)
- mincount = maxmincount;
- if (num_mcv > track_cnt)
- num_mcv = track_cnt;
- for (i = 0; i < num_mcv; i++)
- {
+ /* estimate # of occurrences in sample of a typical nonnull value */
+ for (i = 0; i < num_mcv; i++)
+ {
+ avgcount = values_cnt_remaining / ndistinct_table;
+ /* set minimum threshold count to store a value */
+ mincount = avgcount * 1.25;
+ if (mincount < 2)
+ mincount = 2;
+ /* don't let threshold exceed 1/K, however */
+ maxmincount = (double) values_cnt / (double) (num_bins);
+ if (mincount > maxmincount)
+ mincount = maxmincount;
+ if (num_mcv > track_cnt)
+ num_mcv = track_cnt;
if (track[i].count < mincount)
{
num_mcv = i;
break;
}
+ values_cnt_remaining -= track[i].count;
+ ndistinct_table--;
}
}
On 12/28/17, Jeff Janes <jeff.janes@gmail.com> wrote:
I want to revive a patch I sent couple years ago to the performance list,
as I have seen the issue pop up repeatedly since then.
If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.
After reading the thread you mentioned, I think that's a good approach.
On master, the numerator of avg_count is nonnull_cnt, but you've
(indirectly) set it to values_cnt. You didn't mention it here, but I
think you're right and master is wrong, since in this function only
sortable values go through the MCV path. If I understand the code
correctly, if there are enough too-wide values in the sample, they
could bias the average and prevent normal values from being considered
for the MCV list. Am I off base here?
Anyway, since you're now overwriting ndistinct_table, I would rename
it to ndistinct_remaining for clarity's sake.
This part doesn't use any loop variables, so should happen before the loop:
+ if (num_mcv > track_cnt)
+ num_mcv = track_cnt;
I think this comment
/* estimate # of occurrences in sample of a typical nonnull value */
belongs just above the calculation of avg_count.
I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one. After all, things
included in the MCV don' get represented in the histogram. When I've seen
planning problems due to skewed distributions I also usually see redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.
I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.
-John Naylor
I wrote:
On 12/28/17, Jeff Janes <jeff.janes@gmail.com> wrote:
I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one. After all, things
included in the MCV don' get represented in the histogram. When I've
seen
planning problems due to skewed distributions I also usually see
redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.
I did some quick and dirty testing of this, and I just want to note
that in this case, setting mincount to its hard-coded minimum must
come after setting it to maxmincount, since now maxmincount can go
arbitrarily low.
I'll be travelling for a few days, but I'll do some testing on some
data sets soon. While looking through the archives for more info, I
saw this thread
/messages/by-id/32261.1496611829@sss.pgh.pa.us
which showcases the opposite problem: For more uniform distributions,
there are too many MCVs. Not relevant to your problem, but if I have
time I'll try my hand at testing an approach suggested in that thread
at the same time I test your patch and see how it interacts.
-John Naylor
I wrote:
I'll be travelling for a few days, but I'll do some testing on some
data sets soon.
I've attached some preliminary test methods and results. Probably not
detailed or rigorous enough, but I wanted to share this much before
digging deeper. I created some test data, ran analyze a few times and
eyeballed the results, of which I give a typical example. I used a low
stat target to make it easier to see the general picture. Suggestions
welcome.
-John Naylor
Attachments:
On 28 December 2017 at 01:45, Jeff Janes <jeff.janes@gmail.com> wrote:
If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.
I agree with this thought.
So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.
It looks like this might even have been the original intention of that code.
Patch looks OK, but I think the comments need minor changes above line 2575
As it is, you can partially overcome the too short MCV list by cranking up
the default statistics target, but this is a weak effect and comes at a high
cost of CPU time. In some of the real distributions I've looked at,
cranking up default statistics target is almost entirely ineffective.
Agreed, not a good solution.
[1] Occasionally it will store a much longer MCV list, because no values was
sampled exactly once, which triggers a different code path in which all seen
values are put in the MCV and the histogram is NULL. This is not reliable,
as whether the least-sample value is present in the sample once or twice is
pretty brittle.
And we need a better discussion of risk: Before we generated too few
MCV entried. To what extent might me now generate too many? Which
would be a problem in increased planning time.
I have a slight reservaton about whether 1.25x is still a sensible
heuristic. Should we add a parameter for that to allow testing during
beta?
Marking as Ready For Committer.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
[1] Occasionally it will store a much longer MCV list, because no values
was
sampled exactly once, which triggers a different code path in which all
seen
values are put in the MCV and the histogram is NULL. This is not
reliable,
as whether the least-sample value is present in the sample once or twice
is
pretty brittle.And we need a better discussion of risk: Before we generated too few
MCV entried. To what extent might me now generate too many? Which
would be a problem in increased planning time.
(My apologies, I just now found some time to test this further, but I
don't have additional results yet)
Simon,
Earlier, I referenced a thread [1]/messages/by-id/32261.1496611829@sss.pgh.pa.us complaining that we currently
already have too many MCVs in the case of uniform distributions, with
worse consequences than planning time. Based on my (admittedly quick
and dirty) preliminary testing (see attachment from a couple weeks
ago), this patch exacerbates that problem, and I was hoping to find a
way to fix that.
I have a slight reservaton about whether 1.25x is still a sensible
heuristic.
This was also discussed in [1]/messages/by-id/32261.1496611829@sss.pgh.pa.us, but no patch came out of it. I was
just now turning the formulas discussed there into code, but I'll
defer to someone with more expertise. FWIW, I suspect that a solution
that doesn't take into account a metric like coefficient of variation
will have the wrong behavior sometimes, whether for highly uniform or
highly non-uniform distributions.
[1]: /messages/by-id/32261.1496611829@sss.pgh.pa.us
-John Naylor
I wrote:
FWIW, I suspect that a solution
that doesn't take into account a metric like coefficient of variation
will have the wrong behavior sometimes, whether for highly uniform or
highly non-uniform distributions.
By this I meant the coefficient of variation of the class size in the
sample, as denoted by gamma in the Haas and Stokes paper on page 7.
-John Naylor
I wrote:
I have a slight reservaton about whether 1.25x is still a sensible
heuristic.This was also discussed in [1], but no patch came out of it. I was
just now turning the formulas discussed there into code, but I'll
defer to someone with more expertise. FWIW, I suspect that a solution
that doesn't take into account a metric like coefficient of variation
will have the wrong behavior sometimes, whether for highly uniform or
highly non-uniform distributions.
I spent a few hours hacking on this, and it turns out calculating the
right number of MCVs taking into account both uniform and highly
non-uniform distributions is too delicate a problem for me to solve
right now. The logic suggested by Dean Rasheed in [1] always produces
no MCVs for a perfectly uniform distribution (which is good), but very
often also for other distributions, which is not good. My efforts to
tweak that didn't work, so I didn't get as far as adapting it for the
problem Jeff is trying to solve.
I have not been able to come up with a more compelling alternative, so
I have nothing further to say about the patch under review.
-John Naylor
On 21 January 2018 at 07:26, John Naylor <jcnaylor@gmail.com> wrote:
I spent a few hours hacking on this, and it turns out calculating the
right number of MCVs taking into account both uniform and highly
non-uniform distributions is too delicate a problem for me to solve
right now. The logic suggested by Dean Rasheed in [1] always produces
no MCVs for a perfectly uniform distribution (which is good), but very
often also for other distributions, which is not good. My efforts to
tweak that didn't work, so I didn't get as far as adapting it for the
problem Jeff is trying to solve.
Hmm, Tom suggested that the test based on the average frequency over
all values might be too strict because the estimated number of
distinct values is often too low, so that might explain what you're
seeing.
It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.
Regards,
Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.
This patch is marked Ready for Committer, but that seems wildly optimistic
based on the state of the discussion. It doesn't look to me like we
even have consensus on an algorithm, let alone code for it. Certainly,
whatever we do needs to address the too-many-MCVs issue from [1]/messages/by-id/32261.1496611829@sss.pgh.pa.us as
well as Jeff's too-few-MCVs case.
Even if we had a plausible patch, I'm not sure how we get to the
point of having enough consensus to commit. In the previous thread,
it seemed that some people would object to any change whatsoever :-(
In any case, since it looks like the next step is for someone to come
up with a new proposal, I'm going to set this to Waiting on Author.
regards, tom lane
On 25 January 2018 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.This patch is marked Ready for Committer, but that seems wildly optimistic
based on the state of the discussion. It doesn't look to me like we
even have consensus on an algorithm, let alone code for it. Certainly,
whatever we do needs to address the too-many-MCVs issue from [1] as
well as Jeff's too-few-MCVs case.Even if we had a plausible patch, I'm not sure how we get to the
point of having enough consensus to commit. In the previous thread,
it seemed that some people would object to any change whatsoever :-(In any case, since it looks like the next step is for someone to come
up with a new proposal, I'm going to set this to Waiting on Author.
Dean and John's results show that different algorithms work better for
different cases.
How about we make ANALYZE's MCV algorithm pluggable? And then include,
say, 2 additional algorithms.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1 February 2018 at 13:16, Simon Riggs <simon@2ndquadrant.com> wrote:
On 25 January 2018 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In any case, since it looks like the next step is for someone to come
up with a new proposal, I'm going to set this to Waiting on Author.Dean and John's results show that different algorithms work better for
different cases.How about we make ANALYZE's MCV algorithm pluggable? And then include,
say, 2 additional algorithms.
I don't think we've yet proved that that's needed. I'd rather attempt
to come up with a decent general-purpose algorithm, if possible.
The main problem I have with the originally proposed patch is the lack
of any kind of rigorous justification for the approach of changing the
algorithm to include values that are "significantly more common than
average frequency for values which will not be represented in the MCV
list". So there's no guarantee that the MCVs produced will be backed
by sufficient evidence, and it risks making the too-many-MCVs case
worse.
Of course the current code suffers from both the too-many-MCVs and
too-few-MCVs problems, depending on the data distribution:
For a reasonably uniform distribution with quite a large number of
distinct values, it is possible to generate MCVs on the basis of
having seen only a few instances of the values. Those few values seen
are then not sufficiently statistically significant, and extrapolating
to the whole table produces wildly inaccurate estimates.
For a highly skewed distribution, it is possible for there to be
hardly any values (maybe only one) that appears more than 1.25 times
the average frequency, and so lots of otherwise perfectly good common
values get discarded. For such a distribution, I don't think that the
average frequency is particularly interesting, and it shouldn't be
used to filter the MCV list.
There is also another variant of the too-many-MCVs problem that I
believe is also possible -- if the sample contains a large number of
NULLs or too-wide values, values_cnt could be reduced to the point
where maxmincount is quite small, and again MCVs could get included on
the basis of a very small number of occurrences.
I think it would be better to try to come up with an alternative
algorithm that has a better theoretical basis, and then test that to
see how it holds up in practice.
With that in mind, attached is a patch based on the idea of setting a
bound on the relative standard error on the sample frequency -- so we
include values in the MCV list if and only if they are seen enough
times in the sample that the standard error on the sample frequency is
small compared to the sample frequency itself, and thus we expect that
the relative error resulting from extrapolating that to the whole
table is also small. In theory then, it should never generate too many
MCVs, although tuning of the relative error threshold might be
required to prevent it from generating too few.
Note, this is not a finished patch (e.g., I haven't touched
compute_distinct_stats()). It's merely a possible starting point from
which a lot more testing will be required.
Testing it with the example from [1]/messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org, it does indeed appear to solve
the too-many-MCVs problem in that particular case (actually generating
no MCVs).
Testing it with the highly skewed example at the start of this thread,
it typically generates a couple more MCVs, producing somewhat better
estimates, but to get really good estimates for who=17, you need to
crank up the stats target. It does at least appear to no longer be the
case that cranking up the stats target has a weak effect.
Regards,
Dean
[1]: /messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org
Attachments:
mcv-stats.patchtext/x-patch; charset=US-ASCII; name=mcv-stats.patchDownload
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
new file mode 100644
index 5f21fcb..af92835
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2557,25 +2557,21 @@ compute_scalar_stats(VacAttrStatsP stats
* Decide how many values are worth storing as most-common values. If
* we are able to generate a complete MCV list (all the values in the
* sample will fit, and we think these are all the ones in the table),
- * then do so. Otherwise, store only those values that are
- * significantly more common than the (estimated) average. We set the
- * threshold rather arbitrarily at 25% more than average, with at
- * least 2 instances in the sample. Also, we won't suppress values
- * that have a frequency of at least 1/K where K is the intended
- * number of histogram bins; such values might otherwise cause us to
- * emit duplicate histogram bin boundaries. (We might end up with
- * duplicate histogram entries anyway, if the distribution is skewed;
- * but we prefer to treat such values as MCVs if at all possible.)
+ * then do so. Otherwise, keep only those values that appear
+ * sufficiently often in the sample that it is reasonable to
+ * extrapolate their sample frequencies to the entire table. We do
+ * this by placing an upper bound on the relative standard error of
+ * the sample frequency, so that any estimates the planner generates
+ * from these statistics can be expected to be reasonably accurate.
*
* Note: the first of these cases is meant to address columns with
* small, fixed sets of possible values, such as boolean or enum
* columns. If we can *completely* represent the column population by
* an MCV list that will fit into the stats target, then we should do
* so and thus provide the planner with complete information. But if
- * the MCV list is not complete, it's generally worth being more
- * selective, and not just filling it all the way up to the stats
- * target. So for an incomplete list, we try to take only MCVs that
- * are significantly more common than average.
+ * the MCV list is not complete, then we need to be more selective, to
+ * avoid including values that aren't common enough in the sample to
+ * generate accurate statistics for the population.
*/
if (track_cnt == ndistinct && toowide_cnt == 0 &&
stats->stadistinct > 0 &&
@@ -2586,24 +2582,39 @@ compute_scalar_stats(VacAttrStatsP stats
}
else
{
- double ndistinct_table = stats->stadistinct;
- double avgcount,
- mincount,
- maxmincount;
+ /*----------
+ * Discard values whose relative standard error is too high. A
+ * common rule of thumb when estimating errors in this situation
+ * is to require at least 10 instances in the sample. This is
+ * sufficient to allow the distribution of the sample frequency (a
+ * hypergeometric distribution, since we are doing sampling
+ * without replacement) to be approximated by a normal
+ * distribution, and standard error analysis techniques can be
+ * applied. Then, if the sample size is n, the population size is
+ * N, and the sample frequency is p=cnt/n, the standard error on p
+ * is given by
+ * SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
+ * where the second term is the finite population correction. We
+ * impose an (arbitrarily chosen) upper bound on the relative
+ * standard error of 10% -- i.e., SE/p < 0.1. This gives a lower
+ * bound on the number of instances of the value seen:
+ * cnt > n*(N-n) / (N-n+0.01*n*(N-1))
+ * This bound is at most 100, and approaches 0 as n approaches 0
+ * or N. The case where n approaches 0 isn't actually possible,
+ * since the sample size is at least 300. The case where n
+ * approaches N corresponds to sampling most of the table, in
+ * which case it is reasonable to keep the whole MCV list, as we
+ * do above. Thus it is reasonable to apply this bound for all
+ * inputs (even though the formula is technically only valid when
+ * the right hand side is at least around 10), giving a smooth
+ * transition from this code branch to the all-values-seen branch
+ * above.
+ *----------
+ */
+ double n = samplerows;
+ double N = totalrows;
+ double mincount = n*(N-n) / (N-n+0.01*n*(N-1));
- /* Re-extract estimate of # distinct nonnull values in table */
- if (ndistinct_table < 0)
- ndistinct_table = -ndistinct_table * totalrows;
- /* estimate # occurrences in sample of a typical nonnull value */
- avgcount = (double) nonnull_cnt / ndistinct_table;
- /* set minimum threshold count to store a value */
- mincount = avgcount * 1.25;
- if (mincount < 2)
- mincount = 2;
- /* don't let threshold exceed 1/K, however */
- maxmincount = (double) values_cnt / (double) num_bins;
- if (mincount > maxmincount)
- mincount = maxmincount;
if (num_mcv > track_cnt)
num_mcv = track_cnt;
for (i = 0; i < num_mcv; i++)
On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
For a highly skewed distribution, it is possible for there to be
hardly any values (maybe only one) that appears more than 1.25 times
the average frequency, and so lots of otherwise perfectly good common
values get discarded. For such a distribution, I don't think that the
average frequency is particularly interesting, and it shouldn't be
used to filter the MCV list.
Although I don't think I have enough math background to analyze this
as rigorously as you, I agree with you on this point: the average
frequency doesn't seem very interesting.
One point which I want to emphasize is that the length of the MCV list
bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
ever thought to be more frequent than the least-common MCVs, and
however many non-MCVs we think we have (probably fewer than we
actually have) have to fit into whatever percentage of the table is
consumed by MCVs. This would be less important if we had reliable
n_distinct estimates, but we don't. So, even throwing things into the
MCV list that are no more common than the average item can improve
planning in some cases.
Testing it with the example from [1], it does indeed appear to solve
the too-many-MCVs problem in that particular case (actually generating
no MCVs).Testing it with the highly skewed example at the start of this thread,
it typically generates a couple more MCVs, producing somewhat better
estimates, but to get really good estimates for who=17, you need to
crank up the stats target. It does at least appear to no longer be the
case that cranking up the stats target has a weak effect.
Hmm, nice result. I agree with you that it would be nice if we can
come up with a good general-purpose algorithm for this, rather than
making it pluggable. I don't know whether we can succeed in doing
that but we should try hard, because it's always better when things
just work automatically...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2/2/18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
I think it would be better to try to come up with an alternative
algorithm that has a better theoretical basis, and then test that to
see how it holds up in practice.With that in mind, attached is a patch based on the idea of setting a
bound on the relative standard error on the sample frequency
I did the same basic eyeball testing I did on earlier patches, and
this is the best implementation so far. I've attached some typical
pg_stats contents for HEAD and this patch. More rigorous testing,
including of planner estimates on real data, is still needed of
course, but I think this is definitely in the right direction.
-John Naylor
Attachments:
On 1 February 2018 at 17:49, Robert Haas <robertmhaas@gmail.com> wrote:
One point which I want to emphasize is that the length of the MCV list
bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
ever thought to be more frequent than the least-common MCVs, and
however many non-MCVs we think we have (probably fewer than we
actually have) have to fit into whatever percentage of the table is
consumed by MCVs. This would be less important if we had reliable
n_distinct estimates, but we don't. So, even throwing things into the
MCV list that are no more common than the average item can improve
planning in some cases.
That's a good point, and a nice explanation. I think that lends more
weight to the argument that we should be including as many MCVs as
possible, provided there's enough evidence to justify their inclusion.
Regards,
Dean
On 4 February 2018 at 12:18, John Naylor <jcnaylor@gmail.com> wrote:
I did the same basic eyeball testing I did on earlier patches, and
this is the best implementation so far. I've attached some typical
pg_stats contents for HEAD and this patch. More rigorous testing,
including of planner estimates on real data, is still needed of
course, but I think this is definitely in the right direction.
Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).
Looking at the too-many-MCVs example from [1]/messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org, the table has 100
million rows, and each distinct value occurs 10 times. With the
default stats target of 100, the MCV-list is fully populated with
values, most having been seen two (or occasionally three) times. Thus,
if you query for one of those MCVs, the estimate is 6667 or 10,000
rather than 10, which is a pretty bad over-estimate. Increasing the
stats target improves things, because the sample frequencies go down
correspondingly. The problem is also lessened a bit if you randomise
the order of the values in the second column so that it isn't
correlated to the storage order:
insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a;
However in a sample of 30,000 rows it remains reasonably likely that
the same value will be seen twice (I typically observed 40 to 50 MCVs
in this case, c.f. the birthday paradox), and in a sample of 300,000
rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this
isn't really the fault of the non-uniform ANALYSE sample, but rather
of the current algorithm's belief that seeing a value just twice is
sufficient for it to qualify as an MCV.
The new MCV algorithm solves this by demanding that a value be seen
many more times before being included in cases where the table size is
much larger than the sample size. The exact threshold depends on the
table size, the sample size and the relative error cutoff value. In
this example, with a table size of 100 million rows, the sample size
makes little difference, because its always going to be much smaller
than the table size, so the number of instances required in the sample
depends mostly on the RSE cutoff chosen:
rse_cutoff | sample=30000 | sample=300000 | sample=3000000
------------+--------------+---------------+----------------
10% | 100 | 100 | 97
20% | 25 | 25 | 25
30% | 12 | 12 | 11
40% | 7 | 7 | 7
50% | 4 | 4 | 4
So any of those RSE cutoff's solves the too-many-MCVs problem in this
particular case, giving no MCVs, although 50% is pushing it a bit, and
in any case, the theoretical basis for this algorithm breaks down well
before then.
The advantage of a larger RSE cutoff is that it gives more MCVs for a
non-uniform data distribution, where the current algorithm tends to
give too few MCVs. In the attached example, the table has 1 million
rows of integer data in a normal distribution with a standard
deviation of 50. This typically gives somewhere in the region of 430
to 440 distinct values. Running against HEAD and with this patch, for
varying sample sizes and RSE cutoffs gives the following MCV-list
sizes:
stats_target mcvs (HEAD) mcvs (10% RSE) mcvs (20% RSE) mcvs (30% RSE)
10 10 0 10 10
20 20 0 20 20
30 30 0 30 30
40 40 17 40 40
50 50 50 50 50
100 100 100 100 100
300 132 204 261 290
1000 205 266 315 338
3000 252 364 400 411
10000 296 413 414 412
One thing to note is that the HEAD algorithm never includes more than
around 300 values, because of its requirement for a value to be more
common than average. The new algorithm has no such requirement, so it
will include nearly every value in the MCV list (except the most
uncommon ones) if the stats target is set high enough. Also, the
detailed output from the test shows that the resulting estimates based
on those MCVs are pretty decent.
(Note: this example shows that the too-few-MCVs problem can occur in
any non-uniform distribution, not just highly skewed ones.)
I've also tried this out against some real-world data, and in the
testing I've done so far, the new algorithm is not actually that
sensitive to the precise RSE cutoff chosen, but I'm leaning towards a
value of 20% because there are cases where 10% is a little too strict.
Regards,
Dean
[1]: /messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org
Attachments:
On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).
Maybe it'd be worth having a separate GUC for this, and a reloption to
override the GUC. It seems to me that it would be very reasonable to
want to separately control (a) how much sampling you want to do and
(b) how aggressively you want to be about including things in the MCV
list. Of course, even if we do that, it doesn't excuse us from
needing to set a good default value. And it might not be necessary to
do the extra work for this.
Looking at your data, it definitely seems like 10% would be too strict
-- if I'm reading this correctly, with a stats target in the 10-50
range, your normally-distributed table gets no MCVs at all, rather
than a number equal to the statistics target. That doesn't seem good.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 7 February 2018 at 13:29, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).Maybe it'd be worth having a separate GUC for this, and a reloption to
override the GUC. It seems to me that it would be very reasonable to
want to separately control (a) how much sampling you want to do and
(b) how aggressively you want to be about including things in the MCV
list. Of course, even if we do that, it doesn't excuse us from
needing to set a good default value. And it might not be necessary to
do the extra work for this.Looking at your data, it definitely seems like 10% would be too strict
-- if I'm reading this correctly, with a stats target in the 10-50
range, your normally-distributed table gets no MCVs at all, rather
than a number equal to the statistics target. That doesn't seem good.
(Actually it was the 10-30 stats target range that gave no MCVs at all for 10%)
The fact that 10% leads to no MCVs for a deliberately lowered stats
target of 10 isn't necessarily bad, and might actually have been an
improvement -- e.g., with a stats target of 1, you get no MCVs even
with a 20% RSE cutoff, whereas with HEAD you typically get 1
completely random MCV, and a wildly inaccurate estimate for that
value.
The reason I think 10% is too low is that the extra MCVs you get by
increasing it to 20% generally seem to give better estimates (for a
range of stats targets) than if they weren't included (although it's
sometimes a bit marginal). So 20% seems to strike about the right
balance between too many and too few MCVs.
One thing this new algorithm does do is improve the user's ability to
get more MCVs by increasing the stats target. I'm not yet convinced
there should be a separate knob for the RSE cutoff. For that to be
useful, there would need to be some data distributions for which 10%
(say) was clearly better, and some for which 20% was better. So far,
there doesn't appear to be a massive difference between the two, and
it's nothing that can't compensated for using the existing stats
target knob.
Regards,
Dean
On Wed, Feb 7, 2018 at 10:20 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
One thing this new algorithm does do is improve the user's ability to
get more MCVs by increasing the stats target. I'm not yet convinced
there should be a separate knob for the RSE cutoff. For that to be
useful, there would need to be some data distributions for which 10%
(say) was clearly better, and some for which 20% was better. So far,
there doesn't appear to be a massive difference between the two, and
it's nothing that can't compensated for using the existing stats
target knob.
Fair enough. Do you plan to press forward with this, then, or what's
the next step?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 7 February 2018 at 15:25, Robert Haas <robertmhaas@gmail.com> wrote:
Do you plan to press forward with this, then, or what's
the next step?
Yes, I think the results are pretty good so far, especially for the
more non-uniform distributions. AFAICS it solves the 2 original
complaints, and I've not yet seen a case where it makes things worse.
I plan to do more testing (and if anyone else wants to, that would be
most welcome), and then barring any unexpected issues/objections, I'll
commit it. Probably not this week though.
Regards,
Dean
Hi Dean,
On 2018-02-07 15:58:14 +0000, Dean Rasheed wrote:
On 7 February 2018 at 15:25, Robert Haas <robertmhaas@gmail.com> wrote:
Do you plan to press forward with this, then, or what's
the next step?Yes, I think the results are pretty good so far, especially for the
more non-uniform distributions. AFAICS it solves the 2 original
complaints, and I've not yet seen a case where it makes things worse.I plan to do more testing (and if anyone else wants to, that would be
most welcome), and then barring any unexpected issues/objections, I'll
commit it. Probably not this week though.
This sounds like the patch's status of "waiting on author" isn't right,
and it should more be ready for committer?
Greetings,
Andres Freund
On 1 March 2018 at 21:01, Andres Freund <andres@anarazel.de> wrote:
This sounds like the patch's status of "waiting on author" isn't right,
and it should more be ready for committer?
Yes, I'll take a look at it this weekend.
Regards,
Dean
On 7 February 2018 at 15:58, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
On 7 February 2018 at 15:25, Robert Haas <robertmhaas@gmail.com> wrote:
Do you plan to press forward with this, then, or what's
the next step?I plan to do more testing
TL;DR: I tested this new algorithm using 9 different relative standard
error cutoffs (10%, 15%, ... 50%) against a variety of different data
distributions, and it looks like a definite improvement over the
current algorithm, for a wide range of cutoff values. Overall, it
isn't that sensitive to the exact cutoff chosen, but it looks like a
value of 20% is a good general-purpose choice.
One of the properties of the new algorithm is that the size of the MCV
list responds better to changing the stats target, so I don't think
it's worth having a separate knob to control the cutoff percentage.
Such a knob would have a similar, but weaker effect than the stats
target knob, and thus doesn't seem worthwhile.
Attached is an updated patch. I decided to move the code that
determines the minimum number of occurrences for an MCV list item out
to a separate function. It's not much code, but there's a large
comment to explain its statistical justification, and I didn't want to
duplicate that in 2 different places (possibly to become 3, with the
multivariate MCV stats patch).
I think this is ready for commit now, so barring objections, I will do
so in the next day or so.
---
I tested using the attached python script, which tests a large number
of different data distributions, comparing the results with the patch
vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates
against actual row counts, both for values in the MCV list, and
non-MCV values, making it possible to tell whether it would have been
better if the MCV list had been bigger or smaller.
There's a lot of random variation between tests, but by running a lot
of them, the general pattern can be seen. I ran the test 9 times, with
different MCV cutoff values in the patched code of 10%, 15%, ... 50%,
then collected all the results from the test runs into a single CSV
file to analyse using SQL. The columns in the CSV are:
test_name - A textual description of the data distribution used in the test.
mcv_cutoff - The relative standard error cutoff percentage used (10,
15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
against HEAD.
stats_tgt - The stats target used (100 or 1000).
num_mcvs - The size of the resulting MCV list.
avg_mcv_err - The average percentage difference between estimated
and actual row counts for values in the MCV list. (There's a bit of a
fudge factor in calculating these percentages, to avoid them becoming
too large in cases where the row counts are small, but I think it's
still useful for comparison purposes.)
max_mcv_err - The maximum percentage difference between estimated
and actual row counts for values in the MCV list.
avg_non_mcv_err - The average percentage difference between
estimated and actual row counts for a bunch of non-MCV values.
max_non_mcv_err - The maximum percentage difference between
estimated and actual row counts for a bunch of non-MCV values.
avg_err - The average percentage difference between estimated and
actual row counts for all values tested.
max_err - The maximum percentage difference between estimated and
actual row counts for all values tested.
From this, it's possible to look for cases where the MCV list is too
large or too small. For example, looking for too-many-mcv cases
(generally the more uniform data distributions):
SELECT mcv_cutoff, count(*)
FROM results
WHERE max_mcv_err - max_non_mcv_err > 50
GROUP BY 1
ORDER BY 2 DESC;
mcv_cutoff | count
------------+-------
-40 | 41
-50 | 40
-25 | 39
-15 | 38
-20 | 38
-30 | 38
-35 | 38
-45 | 37
-10 | 37
50 | 35
40 | 35
45 | 34
35 | 33
30 | 33
25 | 25
20 | 13
15 | 6
10 | 3
(18 rows)
SELECT mcv_cutoff, count(*)
FROM results
WHERE max_mcv_err - max_non_mcv_err > 100
GROUP BY 1
ORDER BY 2 DESC;
mcv_cutoff | count
------------+-------
-30 | 17
-40 | 16
-15 | 15
-25 | 14
-10 | 14
-35 | 14
-45 | 13
-50 | 13
-20 | 12
35 | 12
45 | 11
40 | 10
30 | 10
50 | 10
25 | 6
10 | 2
(16 rows)
( and implicitly:
15 | 0
20 | 0 )
So the patched code is generally better at avoiding the too-many-mcvs
problem, but an mcv_cutoff of 30% or more may be too high.
Looking for too-few-mcv cases:
SELECT mcv_cutoff, count(*)
FROM results
WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt)
GROUP BY 1
ORDER BY 2 DESC;
mcv_cutoff | count
------------+-------
-20 | 120
-50 | 116
-30 | 115
-35 | 114
-25 | 114
-10 | 114
-45 | 113
10 | 113
-40 | 113
-15 | 111
15 | 98
20 | 49
25 | 44
40 | 39
30 | 38
45 | 37
35 | 37
50 | 35
(18 rows)
SELECT mcv_cutoff, count(*)
FROM results
WHERE (max_non_mcv_err - max_mcv_err > 100 AND num_mcvs < stats_tgt)
GROUP BY 1
ORDER BY 2 DESC;
mcv_cutoff | count
------------+-------
-20 | 119
-50 | 116
-30 | 115
-40 | 113
-45 | 112
-25 | 112
-10 | 112
-35 | 112
-15 | 111
10 | 106
15 | 51
20 | 45
25 | 43
30 | 38
40 | 37
35 | 36
45 | 34
50 | 31
(18 rows)
and looking specifically to see to what extent the problem persists
when the stats target is increased to 1000:
SELECT mcv_cutoff, count(*)
FROM results
WHERE max_non_mcv_err - max_mcv_err > 100
AND stats_tgt = 1000
GROUP BY 1
ORDER BY 2 DESC;
mcv_cutoff | count
------------+-------
-20 | 69
-30 | 67
-50 | 67
-35 | 66
-40 | 66
-10 | 66
-15 | 65
-25 | 64
-45 | 63
10 | 60
30 | 8
20 | 8
45 | 8
15 | 8
35 | 7
50 | 7
25 | 7
40 | 6
(18 rows)
So again, the patched code is generally better at avoiding the
too-few-mcvs problem, but an mcv_cutoff of 10% is almost certainly too
low.
Looking at the effect of changing the stats target from 100 to 1000 in
the tests:
SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs)
FROM results r1, results r2
WHERE r2.test_name = r1.test_name
AND r2.mcv_cutoff = r1.mcv_cutoff
AND r1.stats_tgt = 100
AND r2.stats_tgt = 1000
GROUP BY 1
ORDER BY 2 DESC;
mcv_cutoff | sum
------------+-------
15 | 88582
20 | 87124
35 | 86828
10 | 86749
45 | 86632
40 | 86552
50 | 86384
25 | 85567
30 | 85352
-25 | 28596
-15 | 28567
-35 | 28559
-40 | 28517
-20 | 28509
-30 | 28483
-45 | 28464
-50 | 28452
-10 | 28411
(18 rows)
it's clear that the patched code is much better at responding to
increasing the stats target and producing more MCVs.
Interestingly, I noticed while eyeballing the raw test output that
with HEAD there are quite a few instances where increasing the stats
target actually decreases the size of the MCV list, and also these are
often in cases where the data is more non-uniformly distributed, and
the MCV list is too small to start with:
SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs)
FROM results r1, results r2
WHERE r2.test_name = r1.test_name
AND r2.mcv_cutoff = r1.mcv_cutoff
AND r1.stats_tgt = 100
AND r2.stats_tgt = 1000
AND r2.num_mcvs < r1.num_mcvs
GROUP BY 1
ORDER BY 2 DESC;
mcv_cutoff | sum
------------+-------
15 | -4
10 | -11
-35 | -5475
-20 | -5493
-50 | -5507
-40 | -5510
-45 | -5510
-30 | -5511
-25 | -5514
-10 | -5528
-15 | -5532
(11 rows)
I could probably keep going, querying the test result data in all
sorts of other ways (and it's attached, should anyone wish to do so,
although bear in mind that it's subject to quite large random
variations), but my general conclusions are:
- The patched code is better than HEAD at avoiding the too-many-mcvs
problem, unless the mcv_cutoff is set too high (maybe around 30% or
more).
- The patched code is much better at avoiding the too-few-mcvs
problem, unless the mcv_cufoff is set too low (10% seems to be too
low).
- The patched code is much better at producing more MCVs as the stats
target is increased, making it better able to handle more non-uniform
distributions.
- An mcv_cutoff of 20% looks like a good general-purpose value.
Regards,
Dean
Attachments:
mcv-tests.tar.bz2application/x-bzip; name=mcv-tests.tar.bz2Download
BZh91AY&SY`Q��~����A�����?������� a��� LR0 � @�� : �
)T�(:4 p��h�a �i���d��Dt
��C#Y� R%H������5�Z���`E��C!O.��h��,��mH��EE@������V��/P Yh � D- jU@%%(|�� ��v�� �
%CFD �6�DU
m�4(�s�6eT����Sm�4�����Z��h�������(��%IR�I%�kT����%$
G�=B��UUP 1�� � ����:`B����|C�@��
7������R�c^� �A� �@ �=� � Z��u�(t�1�>