stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

Started by John Nayloralmost 8 years ago2 messages

jcnaylor@gmail.com

almost 8 years ago

1 attachment(s)

(Starting a new thread so as not to distract review)

On 1/21/18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On 21 January 2018 at 07:26, John Naylor <jcnaylor@gmail.com> wrote:

I spent a few hours hacking on this, and it turns out calculating the
right number of MCVs taking into account both uniform and highly
non-uniform distributions is too delicate a problem for me to solve
right now. The logic suggested by Dean Rasheed in [1] always produces
no MCVs for a perfectly uniform distribution (which is good), but very
often also for other distributions, which is not good. My efforts to
tweak that didn't work, so I didn't get as far as adapting it for the
problem Jeff is trying to solve.

Hmm, Tom suggested that the test based on the average frequency over
all values might be too strict because the estimated number of
distinct values is often too low, so that might explain what you're
seeing.

In my test tables, I've noticed that our Ndistinct estimator is most
inaccurate for geometric distributions, so that's certainly possible,
but confusingly, it occasionally gave an empty MCV list along with a
histogram with a boundary duplicated 5 times, which I thought I was
guarding against. I'm thinking my implementation of your logic is
flawed somehow. In case you're curious I've attached my rough
(complier warnings and all) test patch.

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

If you don't mind, what would the math look like for that?

-John Naylor

Attachments:

SE_test_for_MCV_list.patchtext/x-patch; charset=US-ASCII; name=SE_test_for_MCV_list.patchDownload

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5f21fcb..da21333 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2318,6 +2318,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 	int			num_mcv = stats->attr->attstattarget;
 	int			num_bins = stats->attr->attstattarget;
 	StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
+	double		N,
+				n;
 
 	values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem));
 	tupnoLink = (int *) palloc(samplerows * sizeof(int));
@@ -2525,10 +2527,10 @@ compute_scalar_stats(VacAttrStatsP stats,
 			 */
 			int			f1 = ndistinct - nmultiple + toowide_cnt;
 			int			d = f1 + nmultiple;
-			double		n = samplerows - null_cnt;
-			double		N = totalrows * (1.0 - stats->stanullfrac);
 			double		stadistinct;
 
+			n = samplerows - null_cnt;
+			N = totalrows * (1.0 - stats->stanullfrac);
 			/* N == 0 shouldn't happen, but just in case ... */
 			if (N > 0)
 				stadistinct = (n * d) / ((n - f1) + f1 * n / N);
@@ -2558,9 +2560,44 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
 		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.  Also, we won't suppress values
+		 * significantly more common than the (estimated) average.
+		 *
+		 * Note: For this POC patch, the implementation and comments
+		 * were copied from an email from Dean Rasheed, which contains further references:
+		 * https://www.postgresql.org/message-id/CAEZATCVu9zK0N%3Dnd9ufavabbM8YZiyWYJca0oiE8F31GAY%2B_XA%40mail.gmail.com
+		 *
+		 * We calculate the threshold from the table and sample sizes.
+
+		 * The initial rule of thumb is that the value should occur at
+		 * least 10 times in the sample.
+		 *
+		 * Suppose that N is the population size (total number of rows in the
+		 * table), and n is the sample size, and that some particular candidate
+		 * value appears x times in the sample. Then the "sample proportion" is
+		 * given by p = x/n.
+		 *
+		 * It is reasonable to treat p as having a normal distribution, which
+		 * then allows the margin of error to be analysed using standard
+		 * techniques. We calculate the standard error of the sample proportion:
+		 *
+		 * SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
+		 *
+		 * The second term is a finite population correction. There is a 95%
+		 * probability that the total population proportion lies in the range
+		 *
+		 * [ pmin = p-2*SE, pmax = p+2*SE ]
+		 *
+		 * If there are Nd distinct values in the table, so that the average
+		 * frequency of occurrence of any particular value is 1/Nd, then the test
+		 *
+		 * pmin > 1/Nd
+		 *
+		 * would imply that there is a 97.5% probably that the candidate value is
+		 * more common than the average (since the 5% of the distribution of p
+		 * outside the confidence interval is split evenly below pmin and above
+		 * pmax).
+		 *
+		 * Also, we won't suppress values
 		 * that have a frequency of at least 1/K where K is the intended
 		 * number of histogram bins; such values might otherwise cause us to
 		 * emit duplicate histogram bin boundaries.  (We might end up with
@@ -2587,28 +2624,36 @@ compute_scalar_stats(VacAttrStatsP stats,
 		else
 		{
 			double		ndistinct_table = stats->stadistinct;
-			double		avgcount,
-						mincount,
-						maxmincount;
+			double		avg_freq,
+						hist_width,
+						p,
+						pmin,
+						SE;
 
 			/* Re-extract estimate of # distinct nonnull values in table */
 			if (ndistinct_table < 0)
 				ndistinct_table = -ndistinct_table * totalrows;
-			/* estimate # occurrences in sample of a typical nonnull value */
-			avgcount = (double) nonnull_cnt / ndistinct_table;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-				mincount = 2;
-			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) values_cnt / (double) num_bins;
-			if (mincount > maxmincount)
-				mincount = maxmincount;
+			/* Compute average frequency */
+			avg_freq = (double) 1 / (double) ndistinct_table;
+			/*
+			 * Compute 1/K. If a value has a higher frequency and is not
+			 * included in the MCV list, we might emit duplicate
+			 * histogram bin boundaries.
+			 */
+			hist_width = (double) 1 / (double) num_bins;
+
 			if (num_mcv > track_cnt)
 				num_mcv = track_cnt;
+			/* Test values for minimum threshold frequency */
 			for (i = 0; i < num_mcv; i++)
 			{
-				if (track[i].count < mincount)
+				p = (double) track[i].count / (double) n;
+				SE = sqrt( (double) p * (1 - p) * (N - n) / (double) (n * (N - 1)) );
+				pmin = p - 2 * SE;
+
+				if ((pmin < avg_freq && pmin < hist_width)
+				//~ if ((pmin < avg_freq )
+					 || track[i].count < 10)
 				{
 					num_mcv = i;
 					break;

Dean Rasheed

dean.a.rasheed@gmail.com

almost 8 years ago

In reply to: John Naylor (#1)

Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

On 22 January 2018 at 08:07, John Naylor <jcnaylor@gmail.com> wrote:

On 1/21/18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

If you don't mind, what would the math look like for that?

Using the same syntax as before:

N = Total rows in table (population size)
n = Number of rows sampled
x = Number of times a particular value appears in the sample
p = x/n = Frequency of the value in the sample

So that the standard error of the proportion is

SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

Then the relative standard error (which is usually expressed as a percentage) is

RSE = 100 * SE / p

So if we were to demand that the relative standard error was less
than, say, 10%, then the constraint would just be

SE < 0.1 * p

Note:

* This formula not valid if x is very small (see what I wrote about
being able to approximate the distribution of p with a normal
distribution). So we should also enforce the "rule of thumb" x >= 10.

* The frequency p that we're storing is the count divided by the
overall sample size. So the values for N and n above should not be the
counts that exclude NULLs. As far as this logic is concerned, NULLs
(and too-wide values) are just values not equal to value under
consideration. Thus it appears, from just a quick glance at your
patch, that you're using the wrong values for N and n.

* The RSE is a monotonically decreasing function of p in the range
[0,1], so an upper bound on the RSE equates to a lower bound on the
number of occurrences of the value.

This last point gives me a different idea. Instead of applying this
test *in addition to* the existing tests, how about applying it
*instead of* the existing tests. That is, we keep all MCVs that appear
sufficiently often that we can be reasonably confident in the
estimates produced from their sample frequencies, regardless of
whether or not they are more common than average (the average being
fairly meaningless for a highly skewed distribution anyway).

This is still keeping the most common values, but now we'd be saying
that we keep any value that appears sufficiently often in the sample
that we can extrapolate its sample frequency to produce a reasonably
accurate estimate of the population frequency, and discard values for
which the estimate is likely to be inaccurate.

I have not tested this idea at all, but it seems like it might be
worth considering. It has the nice property that the test depends
entirely on how often the value appears, rather than on other
previously computed statistics, such as Ndistinct.

Doing a quick test in pure SQL, using the highly skewed distribution
Jeff Janes gave in the other thread, with the default sample size of
30,000:

with samples as (
select floor(-log(random())/log(2))::int as who
from generate_series(1,30000)
), freqs as (
select who, count(*) as x, count(*)/30000::float8 as p
from samples group by who
), stats as (
select *, sqrt(p*(1-p)/30000) *
sqrt((10000000-30000)::float8/(10000000-1)) as se
from freqs
)
select *, (10000000*p)::int||'+/-'||(2*se*10000000)::int as "95% interval",
case when x >=10 and se < 0.1*p then 'KEEP' else 'DISCARD' end
from stats order by p desc limit 100;

it pretty consistently keeps the 8 most common values:

who | x | p | se | 95%
interval | case
-----+-------+----------------------+----------------------+-----------------+---------
0 | 15017 | 0.500566666666667 | 0.00288241625942075 |
5005667+/-57648 | KEEP
1 | 7607 | 0.253566666666667 | 0.00250800597590887 |
2535667+/-50160 | KEEP
2 | 3713 | 0.123766666666667 | 0.00189844800003 |
1237667+/-37969 | KEEP
3 | 1855 | 0.0618333333333333 | 0.0013884757600711 |
618333+/-27770 | KEEP
4 | 914 | 0.0304666666666667 | 0.000990788179299791 |
304667+/-19816 | KEEP
5 | 448 | 0.0149333333333333 | 0.000699194759916533 |
149333+/-13984 | KEEP
6 | 229 | 0.00763333333333333 | 0.000501741670388358 |
76333+/-10035 | KEEP
7 | 108 | 0.0036 | 0.000345267009604061 |
36000+/-6905 | KEEP
8 | 46 | 0.00153333333333333 | 0.000225565173744715 |
15333+/-4511 | DISCARD
9 | 34 | 0.00113333333333333 | 0.000193963300230354 |
11333+/-3879 | DISCARD
10 | 17 | 0.000566666666666667 | 0.000137191663419411 |
5667+/-2744 | DISCARD
11 | 11 | 0.000366666666666667 | 0.000110367969704201 |
3667+/-2207 | DISCARD
13 | 1 | 3.33333333333333e-05 | 3.32827427148958e-05 | 333+/-666
| DISCARD
(13 rows)

and if you increase the stats target to 1000 (sample size 300,000)
then it pretty consistently keeps the 11 most common values. So, in
this very limited test at least, it seems to help with Jeff's problem,
and it's no longer the case that cranking up the default statistics
target has a weak effect on the number of MCVs kept.

In the case of a uniform distribution, this might end up keeping all
the values seen, but that's not necessarily wrong -- if they are seen
often enough to make the MCV estimates accurate, why not keep them?

Of course this would need a lot more testing with other data
distributions, and I've not given any thought to the existing
histogram bounds test. Also the 10% RSE bound is fairly arbitrary, and
could be tweaked, but perhaps this is not an entirely crazy idea.

Regards,
Dean