MCV lists for highly skewed distributions

Started by Jeff Janesover 8 years ago40 messageshackers
Jump to latest
#1Jeff Janes
jeff.janes@gmail.com

I want to revive a patch I sent couple years ago to the performance list,
as I have seen the issue pop up repeatedly since then.

Original thread is here:
/messages/by-id/CAK_s-G2tc8r0N3AqPr8fW5QsRQMbZNurgAQ=_ME1aaf4vOmnnA@mail.gmail.com

The problem is that if you have a highly skewed distribution, such as
exponential frequencies, it usually stores too few entries in the MCV list.

For example:

create table foo as select floor(-log(random())/log(2))::int as who from
generate_series(1,10000000);

This will usually store 0 to 5 as MCV with the default stat target[1]Occasionally it will store a much longer MCV list, because no values was sampled exactly once, which triggers a different code path in which all seen values are put in the MCV and the histogram is NULL. This is not reliable, as whether the least-sample value is present in the sample once or twice is pretty brittle..
Then value 6 is not stored, and its ~75k repetitions get averaged over the
remaining distinct values. This leads to horrible misplanning for rare
values. For example, who=17 has 37 entries, but is estimated to have
20,000. The correct join for 37 values is often not the correct one for
20,000.

If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.

So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.

As it is, you can partially overcome the too short MCV list by cranking up
the default statistics target, but this is a weak effect and comes at a
high cost of CPU time. In some of the real distributions I've looked at,
cranking up default statistics target is almost entirely ineffective.

I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one. After all, things
included in the MCV don' get represented in the histogram. When I've seen
planning problems due to skewed distributions I also usually see redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.

[1]: Occasionally it will store a much longer MCV list, because no values was sampled exactly once, which triggers a different code path in which all seen values are put in the MCV and the histogram is NULL. This is not reliable, as whether the least-sample value is present in the sample once or twice is pretty brittle.
was sampled exactly once, which triggers a different code path in which all
seen values are put in the MCV and the histogram is NULL. This is not
reliable, as whether the least-sample value is present in the sample once
or twice is pretty brittle.

Cheers,

Jeff

Attachments:

analyze_highly_skewed.patchtext/x-patch; charset=US-ASCII; name=analyze_highly_skewed.patchDownload+19-15
#2John Naylor
john.naylor@enterprisedb.com
In reply to: Jeff Janes (#1)
Re: MCV lists for highly skewed distributions

On 12/28/17, Jeff Janes <jeff.janes@gmail.com> wrote:

I want to revive a patch I sent couple years ago to the performance list,
as I have seen the issue pop up repeatedly since then.

If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.

So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.

After reading the thread you mentioned, I think that's a good approach.

On master, the numerator of avg_count is nonnull_cnt, but you've
(indirectly) set it to values_cnt. You didn't mention it here, but I
think you're right and master is wrong, since in this function only
sortable values go through the MCV path. If I understand the code
correctly, if there are enough too-wide values in the sample, they
could bias the average and prevent normal values from being considered
for the MCV list. Am I off base here?

Anyway, since you're now overwriting ndistinct_table, I would rename
it to ndistinct_remaining for clarity's sake.

This part doesn't use any loop variables, so should happen before the loop:

+ if (num_mcv > track_cnt)
+ num_mcv = track_cnt;

I think this comment
/* estimate # of occurrences in sample of a typical nonnull value */
belongs just above the calculation of avg_count.

I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one. After all, things
included in the MCV don' get represented in the histogram. When I've seen
planning problems due to skewed distributions I also usually see redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.

I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.

-John Naylor

#3John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#2)
Re: MCV lists for highly skewed distributions

I wrote:

On 12/28/17, Jeff Janes <jeff.janes@gmail.com> wrote:

I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one. After all, things
included in the MCV don' get represented in the histogram. When I've
seen
planning problems due to skewed distributions I also usually see
redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.

I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.

I did some quick and dirty testing of this, and I just want to note
that in this case, setting mincount to its hard-coded minimum must
come after setting it to maxmincount, since now maxmincount can go
arbitrarily low.

I'll be travelling for a few days, but I'll do some testing on some
data sets soon. While looking through the archives for more info, I
saw this thread

/messages/by-id/32261.1496611829@sss.pgh.pa.us

which showcases the opposite problem: For more uniform distributions,
there are too many MCVs. Not relevant to your problem, but if I have
time I'll try my hand at testing an approach suggested in that thread
at the same time I test your patch and see how it interacts.

-John Naylor

#4John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#3)
Re: MCV lists for highly skewed distributions

I wrote:

I'll be travelling for a few days, but I'll do some testing on some
data sets soon.

I've attached some preliminary test methods and results. Probably not
detailed or rigorous enough, but I wanted to share this much before
digging deeper. I created some test data, ran analyze a few times and
eyeballed the results, of which I give a typical example. I used a low
stat target to make it easier to see the general picture. Suggestions
welcome.

-John Naylor

Attachments:

test_analyze_highly_skewed_v1.sqlapplication/sql; name=test_analyze_highly_skewed_v1.sqlDownload
#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#1)
Re: MCV lists for highly skewed distributions

On 28 December 2017 at 01:45, Jeff Janes <jeff.janes@gmail.com> wrote:

If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.

I agree with this thought.

So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.

It looks like this might even have been the original intention of that code.

Patch looks OK, but I think the comments need minor changes above line 2575

As it is, you can partially overcome the too short MCV list by cranking up
the default statistics target, but this is a weak effect and comes at a high
cost of CPU time. In some of the real distributions I've looked at,
cranking up default statistics target is almost entirely ineffective.

Agreed, not a good solution.

[1] Occasionally it will store a much longer MCV list, because no values was
sampled exactly once, which triggers a different code path in which all seen
values are put in the MCV and the histogram is NULL. This is not reliable,
as whether the least-sample value is present in the sample once or twice is
pretty brittle.

And we need a better discussion of risk: Before we generated too few
MCV entried. To what extent might me now generate too many? Which
would be a problem in increased planning time.

I have a slight reservaton about whether 1.25x is still a sensible
heuristic. Should we add a parameter for that to allow testing during
beta?

Marking as Ready For Committer.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6John Naylor
john.naylor@enterprisedb.com
In reply to: Simon Riggs (#5)
Re: MCV lists for highly skewed distributions

[1] Occasionally it will store a much longer MCV list, because no values
was
sampled exactly once, which triggers a different code path in which all
seen
values are put in the MCV and the histogram is NULL. This is not
reliable,
as whether the least-sample value is present in the sample once or twice
is
pretty brittle.

And we need a better discussion of risk: Before we generated too few
MCV entried. To what extent might me now generate too many? Which
would be a problem in increased planning time.

(My apologies, I just now found some time to test this further, but I
don't have additional results yet)

Simon,
Earlier, I referenced a thread [1]/messages/by-id/32261.1496611829@sss.pgh.pa.us complaining that we currently
already have too many MCVs in the case of uniform distributions, with
worse consequences than planning time. Based on my (admittedly quick
and dirty) preliminary testing (see attachment from a couple weeks
ago), this patch exacerbates that problem, and I was hoping to find a
way to fix that.

I have a slight reservaton about whether 1.25x is still a sensible
heuristic.

This was also discussed in [1]/messages/by-id/32261.1496611829@sss.pgh.pa.us, but no patch came out of it. I was
just now turning the formulas discussed there into code, but I'll
defer to someone with more expertise. FWIW, I suspect that a solution
that doesn't take into account a metric like coefficient of variation
will have the wrong behavior sometimes, whether for highly uniform or
highly non-uniform distributions.

[1]: /messages/by-id/32261.1496611829@sss.pgh.pa.us

-John Naylor

#7John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#6)
Re: MCV lists for highly skewed distributions

I wrote:

FWIW, I suspect that a solution
that doesn't take into account a metric like coefficient of variation
will have the wrong behavior sometimes, whether for highly uniform or
highly non-uniform distributions.

By this I meant the coefficient of variation of the class size in the
sample, as denoted by gamma in the Haas and Stokes paper on page 7.

-John Naylor

#8John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#6)
Re: MCV lists for highly skewed distributions

I wrote:

I have a slight reservaton about whether 1.25x is still a sensible
heuristic.

This was also discussed in [1], but no patch came out of it. I was
just now turning the formulas discussed there into code, but I'll
defer to someone with more expertise. FWIW, I suspect that a solution
that doesn't take into account a metric like coefficient of variation
will have the wrong behavior sometimes, whether for highly uniform or
highly non-uniform distributions.

I spent a few hours hacking on this, and it turns out calculating the
right number of MCVs taking into account both uniform and highly
non-uniform distributions is too delicate a problem for me to solve
right now. The logic suggested by Dean Rasheed in [1] always produces
no MCVs for a perfectly uniform distribution (which is good), but very
often also for other distributions, which is not good. My efforts to
tweak that didn't work, so I didn't get as far as adapting it for the
problem Jeff is trying to solve.

I have not been able to come up with a more compelling alternative, so
I have nothing further to say about the patch under review.

[1]
/messages/by-id/32261.1496611829@sss.pgh.pa.us

-John Naylor

#9Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: John Naylor (#8)
Re: MCV lists for highly skewed distributions

On 21 January 2018 at 07:26, John Naylor <jcnaylor@gmail.com> wrote:

I spent a few hours hacking on this, and it turns out calculating the
right number of MCVs taking into account both uniform and highly
non-uniform distributions is too delicate a problem for me to solve
right now. The logic suggested by Dean Rasheed in [1] always produces
no MCVs for a perfectly uniform distribution (which is good), but very
often also for other distributions, which is not good. My efforts to
tweak that didn't work, so I didn't get as far as adapting it for the
problem Jeff is trying to solve.

Hmm, Tom suggested that the test based on the average frequency over
all values might be too strict because the estimated number of
distinct values is often too low, so that might explain what you're
seeing.

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

Regards,
Dean

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#9)
Re: MCV lists for highly skewed distributions

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

This patch is marked Ready for Committer, but that seems wildly optimistic
based on the state of the discussion. It doesn't look to me like we
even have consensus on an algorithm, let alone code for it. Certainly,
whatever we do needs to address the too-many-MCVs issue from [1]/messages/by-id/32261.1496611829@sss.pgh.pa.us as
well as Jeff's too-few-MCVs case.

Even if we had a plausible patch, I'm not sure how we get to the
point of having enough consensus to commit. In the previous thread,
it seemed that some people would object to any change whatsoever :-(

In any case, since it looks like the next step is for someone to come
up with a new proposal, I'm going to set this to Waiting on Author.

regards, tom lane

[1]: /messages/by-id/32261.1496611829@sss.pgh.pa.us

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#10)
Re: MCV lists for highly skewed distributions

On 25 January 2018 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

This patch is marked Ready for Committer, but that seems wildly optimistic
based on the state of the discussion. It doesn't look to me like we
even have consensus on an algorithm, let alone code for it. Certainly,
whatever we do needs to address the too-many-MCVs issue from [1] as
well as Jeff's too-few-MCVs case.

Even if we had a plausible patch, I'm not sure how we get to the
point of having enough consensus to commit. In the previous thread,
it seemed that some people would object to any change whatsoever :-(

In any case, since it looks like the next step is for someone to come
up with a new proposal, I'm going to set this to Waiting on Author.

Dean and John's results show that different algorithms work better for
different cases.

How about we make ANALYZE's MCV algorithm pluggable? And then include,
say, 2 additional algorithms.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Simon Riggs (#11)
Re: MCV lists for highly skewed distributions

On 1 February 2018 at 13:16, Simon Riggs <simon@2ndquadrant.com> wrote:

On 25 January 2018 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In any case, since it looks like the next step is for someone to come
up with a new proposal, I'm going to set this to Waiting on Author.

Dean and John's results show that different algorithms work better for
different cases.

How about we make ANALYZE's MCV algorithm pluggable? And then include,
say, 2 additional algorithms.

I don't think we've yet proved that that's needed. I'd rather attempt
to come up with a decent general-purpose algorithm, if possible.

The main problem I have with the originally proposed patch is the lack
of any kind of rigorous justification for the approach of changing the
algorithm to include values that are "significantly more common than
average frequency for values which will not be represented in the MCV
list". So there's no guarantee that the MCVs produced will be backed
by sufficient evidence, and it risks making the too-many-MCVs case
worse.

Of course the current code suffers from both the too-many-MCVs and
too-few-MCVs problems, depending on the data distribution:

For a reasonably uniform distribution with quite a large number of
distinct values, it is possible to generate MCVs on the basis of
having seen only a few instances of the values. Those few values seen
are then not sufficiently statistically significant, and extrapolating
to the whole table produces wildly inaccurate estimates.

For a highly skewed distribution, it is possible for there to be
hardly any values (maybe only one) that appears more than 1.25 times
the average frequency, and so lots of otherwise perfectly good common
values get discarded. For such a distribution, I don't think that the
average frequency is particularly interesting, and it shouldn't be
used to filter the MCV list.

There is also another variant of the too-many-MCVs problem that I
believe is also possible -- if the sample contains a large number of
NULLs or too-wide values, values_cnt could be reduced to the point
where maxmincount is quite small, and again MCVs could get included on
the basis of a very small number of occurrences.

I think it would be better to try to come up with an alternative
algorithm that has a better theoretical basis, and then test that to
see how it holds up in practice.

With that in mind, attached is a patch based on the idea of setting a
bound on the relative standard error on the sample frequency -- so we
include values in the MCV list if and only if they are seen enough
times in the sample that the standard error on the sample frequency is
small compared to the sample frequency itself, and thus we expect that
the relative error resulting from extrapolating that to the whole
table is also small. In theory then, it should never generate too many
MCVs, although tuning of the relative error threshold might be
required to prevent it from generating too few.

Note, this is not a finished patch (e.g., I haven't touched
compute_distinct_stats()). It's merely a possible starting point from
which a lot more testing will be required.

Testing it with the example from [1]/messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org, it does indeed appear to solve
the too-many-MCVs problem in that particular case (actually generating
no MCVs).

Testing it with the highly skewed example at the start of this thread,
it typically generates a couple more MCVs, producing somewhat better
estimates, but to get really good estimates for who=17, you need to
crank up the stats target. It does at least appear to no longer be the
case that cranking up the stats target has a weak effect.

Regards,
Dean

[1]: /messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org

Attachments:

mcv-stats.patchtext/x-patch; charset=US-ASCII; name=mcv-stats.patchDownload+41-30
#13Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#12)
Re: MCV lists for highly skewed distributions

On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

For a highly skewed distribution, it is possible for there to be
hardly any values (maybe only one) that appears more than 1.25 times
the average frequency, and so lots of otherwise perfectly good common
values get discarded. For such a distribution, I don't think that the
average frequency is particularly interesting, and it shouldn't be
used to filter the MCV list.

Although I don't think I have enough math background to analyze this
as rigorously as you, I agree with you on this point: the average
frequency doesn't seem very interesting.

One point which I want to emphasize is that the length of the MCV list
bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
ever thought to be more frequent than the least-common MCVs, and
however many non-MCVs we think we have (probably fewer than we
actually have) have to fit into whatever percentage of the table is
consumed by MCVs. This would be less important if we had reliable
n_distinct estimates, but we don't. So, even throwing things into the
MCV list that are no more common than the average item can improve
planning in some cases.

Testing it with the example from [1], it does indeed appear to solve
the too-many-MCVs problem in that particular case (actually generating
no MCVs).

Testing it with the highly skewed example at the start of this thread,
it typically generates a couple more MCVs, producing somewhat better
estimates, but to get really good estimates for who=17, you need to
crank up the stats target. It does at least appear to no longer be the
case that cranking up the stats target has a weak effect.

Hmm, nice result. I agree with you that it would be nice if we can
come up with a good general-purpose algorithm for this, rather than
making it pluggable. I don't know whether we can succeed in doing
that but we should try hard, because it's always better when things
just work automatically...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14John Naylor
john.naylor@enterprisedb.com
In reply to: Dean Rasheed (#12)
Re: MCV lists for highly skewed distributions

On 2/2/18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

I think it would be better to try to come up with an alternative
algorithm that has a better theoretical basis, and then test that to
see how it holds up in practice.

With that in mind, attached is a patch based on the idea of setting a
bound on the relative standard error on the sample frequency

I did the same basic eyeball testing I did on earlier patches, and
this is the best implementation so far. I've attached some typical
pg_stats contents for HEAD and this patch. More rigorous testing,
including of planner estimates on real data, is still needed of
course, but I think this is definitely in the right direction.

-John Naylor

Attachments:

test_mcvstats_v1.sqlapplication/sql; name=test_mcvstats_v1.sqlDownload
#15Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Robert Haas (#13)
Re: MCV lists for highly skewed distributions

On 1 February 2018 at 17:49, Robert Haas <robertmhaas@gmail.com> wrote:

One point which I want to emphasize is that the length of the MCV list
bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
ever thought to be more frequent than the least-common MCVs, and
however many non-MCVs we think we have (probably fewer than we
actually have) have to fit into whatever percentage of the table is
consumed by MCVs. This would be less important if we had reliable
n_distinct estimates, but we don't. So, even throwing things into the
MCV list that are no more common than the average item can improve
planning in some cases.

That's a good point, and a nice explanation. I think that lends more
weight to the argument that we should be including as many MCVs as
possible, provided there's enough evidence to justify their inclusion.

Regards,
Dean

#16Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: John Naylor (#14)
Re: MCV lists for highly skewed distributions

On 4 February 2018 at 12:18, John Naylor <jcnaylor@gmail.com> wrote:

I did the same basic eyeball testing I did on earlier patches, and
this is the best implementation so far. I've attached some typical
pg_stats contents for HEAD and this patch. More rigorous testing,
including of planner estimates on real data, is still needed of
course, but I think this is definitely in the right direction.

Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Looking at the too-many-MCVs example from [1]/messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org, the table has 100
million rows, and each distinct value occurs 10 times. With the
default stats target of 100, the MCV-list is fully populated with
values, most having been seen two (or occasionally three) times. Thus,
if you query for one of those MCVs, the estimate is 6667 or 10,000
rather than 10, which is a pretty bad over-estimate. Increasing the
stats target improves things, because the sample frequencies go down
correspondingly. The problem is also lessened a bit if you randomise
the order of the values in the second column so that it isn't
correlated to the storage order:

insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a;

However in a sample of 30,000 rows it remains reasonably likely that
the same value will be seen twice (I typically observed 40 to 50 MCVs
in this case, c.f. the birthday paradox), and in a sample of 300,000
rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this
isn't really the fault of the non-uniform ANALYSE sample, but rather
of the current algorithm's belief that seeing a value just twice is
sufficient for it to qualify as an MCV.

The new MCV algorithm solves this by demanding that a value be seen
many more times before being included in cases where the table size is
much larger than the sample size. The exact threshold depends on the
table size, the sample size and the relative error cutoff value. In
this example, with a table size of 100 million rows, the sample size
makes little difference, because its always going to be much smaller
than the table size, so the number of instances required in the sample
depends mostly on the RSE cutoff chosen:

rse_cutoff | sample=30000 | sample=300000 | sample=3000000
------------+--------------+---------------+----------------
10% | 100 | 100 | 97
20% | 25 | 25 | 25
30% | 12 | 12 | 11
40% | 7 | 7 | 7
50% | 4 | 4 | 4

So any of those RSE cutoff's solves the too-many-MCVs problem in this
particular case, giving no MCVs, although 50% is pushing it a bit, and
in any case, the theoretical basis for this algorithm breaks down well
before then.

The advantage of a larger RSE cutoff is that it gives more MCVs for a
non-uniform data distribution, where the current algorithm tends to
give too few MCVs. In the attached example, the table has 1 million
rows of integer data in a normal distribution with a standard
deviation of 50. This typically gives somewhere in the region of 430
to 440 distinct values. Running against HEAD and with this patch, for
varying sample sizes and RSE cutoffs gives the following MCV-list
sizes:

stats_target mcvs (HEAD) mcvs (10% RSE) mcvs (20% RSE) mcvs (30% RSE)
10 10 0 10 10
20 20 0 20 20
30 30 0 30 30
40 40 17 40 40
50 50 50 50 50
100 100 100 100 100
300 132 204 261 290
1000 205 266 315 338
3000 252 364 400 411
10000 296 413 414 412

One thing to note is that the HEAD algorithm never includes more than
around 300 values, because of its requirement for a value to be more
common than average. The new algorithm has no such requirement, so it
will include nearly every value in the MCV list (except the most
uncommon ones) if the stats target is set high enough. Also, the
detailed output from the test shows that the resulting estimates based
on those MCVs are pretty decent.

(Note: this example shows that the too-few-MCVs problem can occur in
any non-uniform distribution, not just highly skewed ones.)

I've also tried this out against some real-world data, and in the
testing I've done so far, the new algorithm is not actually that
sensitive to the precise RSE cutoff chosen, but I'm leaning towards a
value of 20% because there are cases where 10% is a little too strict.

Regards,
Dean

[1]: /messages/by-id/20170522132017.29944.48391@wrigleys.postgresql.org

Attachments:

test_normal_0_50.sqlapplication/octet-stream; name=test_normal_0_50.sqlDownload
#17Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#16)
Re: MCV lists for highly skewed distributions

On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Maybe it'd be worth having a separate GUC for this, and a reloption to
override the GUC. It seems to me that it would be very reasonable to
want to separately control (a) how much sampling you want to do and
(b) how aggressively you want to be about including things in the MCV
list. Of course, even if we do that, it doesn't excuse us from
needing to set a good default value. And it might not be necessary to
do the extra work for this.

Looking at your data, it definitely seems like 10% would be too strict
-- if I'm reading this correctly, with a stats target in the 10-50
range, your normally-distributed table gets no MCVs at all, rather
than a number equal to the statistics target. That doesn't seem good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Robert Haas (#17)
Re: MCV lists for highly skewed distributions

On 7 February 2018 at 13:29, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Maybe it'd be worth having a separate GUC for this, and a reloption to
override the GUC. It seems to me that it would be very reasonable to
want to separately control (a) how much sampling you want to do and
(b) how aggressively you want to be about including things in the MCV
list. Of course, even if we do that, it doesn't excuse us from
needing to set a good default value. And it might not be necessary to
do the extra work for this.

Looking at your data, it definitely seems like 10% would be too strict
-- if I'm reading this correctly, with a stats target in the 10-50
range, your normally-distributed table gets no MCVs at all, rather
than a number equal to the statistics target. That doesn't seem good.

(Actually it was the 10-30 stats target range that gave no MCVs at all for 10%)

The fact that 10% leads to no MCVs for a deliberately lowered stats
target of 10 isn't necessarily bad, and might actually have been an
improvement -- e.g., with a stats target of 1, you get no MCVs even
with a 20% RSE cutoff, whereas with HEAD you typically get 1
completely random MCV, and a wildly inaccurate estimate for that
value.

The reason I think 10% is too low is that the extra MCVs you get by
increasing it to 20% generally seem to give better estimates (for a
range of stats targets) than if they weren't included (although it's
sometimes a bit marginal). So 20% seems to strike about the right
balance between too many and too few MCVs.

One thing this new algorithm does do is improve the user's ability to
get more MCVs by increasing the stats target. I'm not yet convinced
there should be a separate knob for the RSE cutoff. For that to be
useful, there would need to be some data distributions for which 10%
(say) was clearly better, and some for which 20% was better. So far,
there doesn't appear to be a massive difference between the two, and
it's nothing that can't compensated for using the existing stats
target knob.

Regards,
Dean

#19Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#18)
Re: MCV lists for highly skewed distributions

On Wed, Feb 7, 2018 at 10:20 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

One thing this new algorithm does do is improve the user's ability to
get more MCVs by increasing the stats target. I'm not yet convinced
there should be a separate knob for the RSE cutoff. For that to be
useful, there would need to be some data distributions for which 10%
(say) was clearly better, and some for which 20% was better. So far,
there doesn't appear to be a massive difference between the two, and
it's nothing that can't compensated for using the existing stats
target knob.

Fair enough. Do you plan to press forward with this, then, or what's
the next step?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Robert Haas (#19)
Re: MCV lists for highly skewed distributions

On 7 February 2018 at 15:25, Robert Haas <robertmhaas@gmail.com> wrote:

Do you plan to press forward with this, then, or what's
the next step?

Yes, I think the results are pretty good so far, especially for the
more non-uniform distributions. AFAICS it solves the 2 original
complaints, and I've not yet seen a case where it makes things worse.

I plan to do more testing (and if anyone else wants to, that would be
most welcome), and then barring any unexpected issues/objections, I'll
commit it. Probably not this week though.

Regards,
Dean

#21Andres Freund
andres@anarazel.de
In reply to: Dean Rasheed (#20)
#22Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Andres Freund (#21)
#23Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#20)
#24John Naylor
john.naylor@enterprisedb.com
In reply to: Dean Rasheed (#23)
#25Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: John Naylor (#24)
#26Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#25)
#27John Naylor
john.naylor@enterprisedb.com
In reply to: Dean Rasheed (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#26)
#29Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#28)
#30Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: John Naylor (#27)
#31Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#30)
#32Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#28)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#32)
#34Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#33)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#34)
#36John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#27)
#37Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: John Naylor (#36)
#38Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#37)
#39John Naylor
john.naylor@enterprisedb.com
In reply to: Dean Rasheed (#38)
#40Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: John Naylor (#39)