hist boundary duplicates bug in head and 8.3
For heavy tailed distributions, it is possible for analyze to
duplicate histogram boundaries.
Here is the output from a test against HEAD; I've attached the test data.
=# create table bug(f float);
COPY 100000
=# copy bug from '/tmp/test_data.txt';
COPY 100000
=# analyze bug;
ANALYZE
=# select histogram_bounds from pg_stats where tablename='bug';
histogram_bounds
------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------
--------
{34,34,34,34,34,34,34,36,36,36,36,36,37,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,41,41,41,41,42,42,42,43,43,44,44,44,45,45,45,46,46,46,47,47,48,48,48,4
9,49,50,50,51,51,52,53,53,54,55,55,56,57,58,58,59,60,61,62,63,64,65,67,68,70,72,74,76,79,81,83,86,89,91,94,98,101,105,110,117,122,131,139,152,171,202,236,30
8,1457}
(1 row)
Analyze assumes that if a value has a sample count less than
samplesize/num_buckets, ( maxmincount near line 2170 in
commands/analyze.c) then it is safe to include it in a histogram.
However, because the histogram only contains non mcv's, this is
incorrect.
As far as I can see, there are 4 solutions:
1) track all of the distinct values
This wouldn't be *too* expensive in analyze, especially considering we
are tracking all of the sampled values as it is. However, this opens
up the possibility of having huge mcv's lists in the worst case.To see
this, consider a distribution such that the most common value was 20%
of the table, the next mcv was 20% of the remaining entries, etc.
Clearly, for stats targets greater than 5, every value in the table
would overrun a histogram boundary, leading to an mcv list that
contained every distinct value in the sample.
2) reduce number_of_bins if values exist with frequency greater than 1/nbins
This would fix the bug, but at the cost of reducing the utility of the
histogram ( it would introduce a large skew to the ndistinct
distribution, which is assumed to be uniform over non-mcvs ).
3) use variable width histogram bins over all values.
This is probably the cleanest solution, but the most invasive.
4) Fix the binary search in ineqsel to correctly find the boundaries,
even with duplicates
This would also be relatively clean, but are the hist boundaries
assumption of being strictly increasing being satisfied anywhere else
besides ineqsel?
I've attached a patch that is a compromise between 1 and 2. It puts a
hard limit on the number of mcv's at 2x the stats target, and then, if
there are still values with too high a frequency, it reduces the
number of histogram buckets.
-Nathan
Attachments:
hist_bndry_bug_fix.patchtext/x-patch; name=hist_bndry_bug_fix.patchDownload
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8536332..7f4e154 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1919,9 +1919,13 @@ compute_scalar_stats(VacAttrStatsP stats,
int num_bins = stats->attr->attstattarget;
StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
+ /* set the maximum number of mcv's that we will store */
+ int track_multiplier = 2;
+ int max_num_mcv = num_mcv*track_multiplier + 1;
+
values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem));
tupnoLink = (int *) palloc(samplerows * sizeof(int));
- track = (ScalarMCVItem *) palloc(num_mcv * sizeof(ScalarMCVItem));
+ track = (ScalarMCVItem *) palloc(max_num_mcv * sizeof(ScalarMCVItem));
SelectSortFunction(mystats->ltopr, false, &cmpFn, &cmpFlags);
fmgr_info(cmpFn, &f_cmpfn);
@@ -2034,7 +2038,7 @@ compute_scalar_stats(VacAttrStatsP stats,
if (dups_cnt > 1)
{
nmultiple++;
- if (track_cnt < num_mcv ||
+ if (track_cnt < max_num_mcv ||
dups_cnt > track[track_cnt - 1].count)
{
/*
@@ -2045,7 +2049,7 @@ compute_scalar_stats(VacAttrStatsP stats,
*/
int j;
- if (track_cnt < num_mcv)
+ if (track_cnt < max_num_mcv)
track_cnt++;
for (j = track_cnt - 1; j > 0; j--)
{
@@ -2164,18 +2168,42 @@ compute_scalar_stats(VacAttrStatsP stats,
mincount = 2;
/* don't let threshold exceed 1/K, however */
maxmincount = (double) samplerows / (double) num_bins;
- if (mincount > maxmincount)
- mincount = maxmincount;
- if (num_mcv > track_cnt)
- num_mcv = track_cnt;
- for (i = 0; i < num_mcv; i++)
+ for ( i = 0; i < max_num_mcv - 1 && i < track_cnt; i++ )
{
- if (track[i].count < mincount)
- {
- num_mcv = i;
- break;
- }
+ /*
+ * if the count is high enough to overrun a histogram boundary,
+ * keep adding mcv's until we're below, or we have reached the
+ * hard limit of max_num_mcv - 1
+ */
+ if ( track[i].count >= maxmincount )
+ {
+ /* update maxmin to account for the previous mcv */
+ maxmincount -= (double) track[i].count / (double) num_bins;
+ continue;
+ }
+
+ if( i >= num_mcv )
+ break;
+
+ if( track[i].count < mincount )
+ break;
}
+
+ /* store the desired number of mcv's as determined by the loop */
+ num_mcv = i;
+
+ /*
+ * finally, make sure that num bins makes sense. Despite the fact
+ * that we allow for track_multiplier the desired mcv values to be
+ * tracked for really heavy tailed distributions it is possible that
+ * we still havn't dropped below the maxmin threshhold. As such, if
+ * the last tracked value is above this threshhold, then make sure
+ * that remaining hist values/nbins < last tracked value.
+ */
+ if (num_mcv == max_num_mcv-1 && track[num_mcv].count > maxmincount)
+ {
+ num_bins = (int) num_bins*(maxmincount/track[num_mcv].count);
+ }
}
/* Generate MCV slot entry */
Import Notes
Reply to msg id not found: 6fa3b6e20901051616y62541dc1jac98b0cdd9736f06@mail.gmail.comReference msg id not found: 6fa3b6e20901051616y62541dc1jac98b0cdd9736f06@mail.gmail.com
"Nathan Boley" <npboley@gmail.com> writes:
For heavy tailed distributions, it is possible for analyze to
duplicate histogram boundaries.
I don't think this is a bug. You've got values that didn't make it into
the MCV list, but nonetheless occupy multiple buckets' worth of space in
the remainder of the distribution. They *should* appear multiple times
in the histogram. If they didn't, the histogram would be understating
their frequency.
regards, tom lane
For heavy tailed distributions, it is possible for analyze to
duplicate histogram boundaries.I don't think this is a bug.
hmmm... Well, I assumed it was a bug from a comment in analyze.
From ( near ) line 2130 in analyze.c
* least 2 instances in the sample. Also, we won't suppress values
* that have a frequency of at least 1/K where K is the intended
* number of histogram bins; such values might otherwise cause us to
* emit duplicate histogram bin boundaries.
*/
If this is expected, I'm also not sure what the use of maxmincount in
analyze is...
Thanks for the response,
Nathan
"Nathan Boley" <npboley@gmail.com> writes:
I don't think this is a bug.
hmmm... Well, I assumed it was a bug from a comment in analyze.
From ( near ) line 2130 in analyze.c
* least 2 instances in the sample. Also, we won't suppress values
* that have a frequency of at least 1/K where K is the intended
* number of histogram bins; such values might otherwise cause us to
* emit duplicate histogram bin boundaries.
That's talking about a case where we have a choice whether to include a
value in the MCV list or not. Once the MCV list is maxed out, we can't
do anything to avoid duplicates.
regards, tom lane
On Tue, 2009-01-06 at 18:40 -0500, Tom Lane wrote:
"Nathan Boley" <npboley@gmail.com> writes:
I don't think this is a bug.
hmmm... Well, I assumed it was a bug from a comment in analyze.
From ( near ) line 2130 in analyze.c
* least 2 instances in the sample. Also, we won't suppress values
* that have a frequency of at least 1/K where K is the intended
* number of histogram bins; such values might otherwise cause us to
* emit duplicate histogram bin boundaries.That's talking about a case where we have a choice whether to include a
value in the MCV list or not. Once the MCV list is maxed out, we can't
do anything to avoid duplicates.
Surely the most important point in the OP was that ineqsel does not
correctly binary search in the presence of duplicates.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Surely the most important point in the OP was that ineqsel does not
correctly binary search in the presence of duplicates.
It would have been if I were correct :-( .
Looking at it again, that was from a bug in my code. Thanks for your
time, and sorry about the noise.
-Nathan