Bad n_distinct estimation; hacks suggested?
Folks,
Params: PostgreSQL 8.0.1 on Solaris 10
Statistics = 500
(tablenames have been changed to protect NDA)
e1=# select tablename, null_frac, correlation, n_distinct from pg_stats where
tablename = 'clickstream1' andattname = 'session_id';
tablename | null_frac | correlation | n_distinct
----------------------+-----------+-------------+------------
clickstream1 | 0 | 0.412034 | 378174
(2 rows)
e1=# select count(distinct session_id) from clickstream1;
count
---------
3174813
As you can see, n_distinct estimation is off by a factor of 10x and it's
causing query planning problems. Any suggested hacks to improve the
histogram on this?
(BTW, increasing the stats to 1000 only doubles n_distinct, and doesn't solve
the problem)
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
As you can see, n_distinct estimation is off by a factor of 10x and it's
causing query planning problems. Any suggested hacks to improve the
histogram on this?
What's the histogram itself look like? (I'd like to see the whole
pg_stats row not just part of it ...) There's probably no point in
showing the target=1000 version, but maybe target=100 would be
informative.
regards, tom lane
Tom,
What's the histogram itself look like? (I'd like to see the whole
pg_stats row not just part of it ...) There's probably no point in
showing the target=1000 version, but maybe target=100 would be
informative.
Here is the stats = 100 version. Notice that n_distinct has gone down.
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals
| most_common_freqs
| histogram_bounds |
correlation
------------+----------------------+------------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
public | web_site_activity_fa | session_id | 0 | 8 |
96107 |
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,70986,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,6239825,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,2546720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,4388025}
|
{0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0.000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.000366667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}
|
{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,386486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,6395250,6424719,6888329}
| 0.41744
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Tom,
Any thoughts? This is really messing up query execution all across the
database ...
--Josh
Here is the stats = 100 version. Notice that n_distinct has gone down.
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals| most_common_freqs
| histogram_bounds |correlation
-------------------+------------- public | web_site_activity_fa |
session_id | 0 | 8 | 96107 |
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705
488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006
604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,
4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387
835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23
450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,
71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709
86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982
5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25
46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802
5}{0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000
733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.00
05,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0.
000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036
6667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333
,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.
0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.0002666
67,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0
.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000
233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002333
33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0
002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38
6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,
1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038
573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,
2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832
224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,
3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804
593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,
4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078
912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,
6395250,6424719,6888329}| 0.41744
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Hi.
Sometimes, if the random number generator, that PostgreSQL uses,
isn't good enough, the randomly selected pages for the statistics
might not be random enough.
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Good statistics depend on good random numbers.
So, for example, if you have one million pages, but the upper bound for
the random
numbers is one hundred thousand pages, the statistics might get tuned.
Or some random number generator has for example only 32000 different values.
Regards,
Marko Ristola
Josh Berkus wrote:
Show quoted text
Tom,
Any thoughts? This is really messing up query execution all across the
database ...--Josh
Here is the stats = 100 version. Notice that n_distinct has gone down.
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals| most_common_freqs
| histogram_bounds |correlation
-------------------+------------- public | web_site_activity_fa |
session_id | 0 | 8 | 96107 |
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705
488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006
604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,
4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387
835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23
450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,
71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709
86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982
5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25
46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802
5}{0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000
733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.00
05,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0.
000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036
6667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333
,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.
0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.0002666
67,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0
.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000
233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002333
33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0
002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38
6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,
1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038
573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,
2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832
224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,
3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804
593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,
4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078
912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,
6395250,6424719,6888329}| 0.41744
Marko,
Sometimes, if the random number generator, that PostgreSQL uses,
isn't good enough, the randomly selected pages for the statistics
might not be random enough.Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Hmmm. Good point. Will have to test on Linux.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?Hmmm. Good point. Will have to test on Linux.
Nope:
Linux 2.4.20:
test=# select tablename, attname, n_distinct from pg_stats where tablename =
'web_site_activity_fa';
tablename | attname | n_distinct
----------------------+---------------------+------------
web_site_activity_fa | session_id | 626127
test=# select count(distinct session_id) from web_site_activity_fa;
count
---------
3174813
(1 row)
... I think the problem is in our heuristic sampling code. I'm not the first
person to have this kind of a problem. Will be following up with tests ...
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
... I think the problem is in our heuristic sampling code. I'm not the first
person to have this kind of a problem. Will be following up with tests ...
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized samples
only work for things like the histograms that can be analyzed through standard
statistics population sampling which depends on the law of large numbers.
n_distinct requires you to estimate how frequently very rare things occur. You
can't apply the law of large numbers because even a single instance of a value
out of a large pool alters the results disproportionately.
To get a valid estimate for n_distinct you would need to sample a fixed
percentage of the table. Not a fixed size sample. That just isn't practical.
Moreover, I think the percentage would have to be quite large. Even if you
sampled half the table I think the confidence interval would be quite wide.
The situation is worsened because it's unclear how to interpolate values for
subsets of the table. If the histogram says you have a million records and
you're adding a clause that has a selectivity of 50% then half a million is a
good guess. But if what you care about is n_distinct and you start with a
million records with 1,000 distinct values and then apply a clause that
filters 50% of them, how do you estimate the resulting n_distinct? 500 is
clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0
to 1,000 and you have no good way to figure out where the truth lies.
So I fear this is fundamentally a hopeless situation. It's going to be
difficult to consistently get good plans for any queries that depend on good
estimates for n_distinct.
--
greg
Greg,
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized
samples only work for things like the histograms that can be analyzed
through standard statistics population sampling which depends on the law of
large numbers.
Well, unusual distributions are certainly tough. But I think the problem
exists even for relatively well-distributed populations. Part of it is, I
believe, the formula we are using:
n*d / (n - f1 + f1*n/N)
This is an estimation formula from Haas and Stokes in IBM Research Report RJ
10025, and is called the DUJ1 formula.
(http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf) It appears to
suck. For example, take my table:
rows: 26million (N)
distinct values: 3.4million
I took a random sample of 1000 rows (n) from that table. It contained:
968 values that occurred only once (f1)
981 distinct values (d)
Any human being looking at that sample would assume a large number of distinct
values; after all, 96.8% of the values occurred only once. But the formula
gives us:
1000*981 / ( 1000 - 968 + ( 968 * 1000/26000000 ) ) = 30620
This is obviously dramatically wrong, by a factor of 100. The math gets worse
as the sample size goes down:
Sample 250, 248 distinct values, 246 unique values:
250*248 / ( 250 - 246 + ( 246 * 250 / 26000000 ) ) = 15490
Even in a case with an ovewhelming majority of unique values, the formula gets
it wrong:
999 unque values in sample
998 appearing only once:
1000*999 / ( 1000 - 998 + ( 998 * 1000 / 26000000 ) ) = 490093
This means that, with a sample size of 1000 a table of 26million rows cannot
ever have with this formula more than half a million distinct values, unless
the column is a unique column.
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.
This formula appears broken everywhere:
Table: 969000 rows
Distinct values: 374000
Sample Size: 1000
Unique values in sample: 938
Values appearing only once: 918
1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308
Again, too small by a factor of 20x.
This is so broken, in fact, that I'm wondering if we've read the paper right?
I've perused the paper on almaden, and the DUJ1 formula appears considerably
more complex than the formula we're using.
Can someone whose math is more recent than calculus in 1989 take a look at
that paper, and look at the formula toward the bottom of page 10, and see if
we are correctly interpreting it? I'm particularly confused as to what "q"
and "d-sub-n" represent. Thanks!
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus said:
Well, unusual distributions are certainly tough. But I think the
problem exists even for relatively well-distributed populations.
Part of it is, I believe, the formula we are using:n*d / (n - f1 + f1*n/N)
[snip]
This is so broken, in fact, that I'm wondering if we've read the paper
right? I've perused the paper on almaden, and the DUJ1 formula
appears considerably more complex than the formula we're using.Can someone whose math is more recent than calculus in 1989 take a look
at that paper, and look at the formula toward the bottom of page 10,
and see if we are correctly interpreting it? I'm particularly
confused as to what "q" and "d-sub-n" represent. Thanks!
Math not too recent ...
I quickly read the paper and independently came up with the same formula you
say above we are applying. The formula is on the page that is numbered 6,
although it's the tenth page in the PDF.
q = n/N = ratio of sample size to population size
d_sub_n = d = number of distinct classes in sample
cheers
andrew
People,
Can someone whose math is more recent than calculus in 1989 take a look at
that paper, and look at the formula toward the bottom of page 10, and see
if we are correctly interpreting it? I'm particularly confused as to
what "q" and "d-sub-n" represent. Thanks!
Actually, I managed to solve for these and it appears we are using the formula
correctly. It's just a bad formula.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.
Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(
I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.
regards, tom lane
Here is my opinion.
I hope this helps.
Maybe there is no one good formula:
On boolean type, there are at most 3 distinct values.
There is an upper bound for fornames in one country.
There is an upper bound for last names in one country.
There is a fixed number of states and postal codes in one country.
On the other hand, with timestamp, every value could be distinct.
A primary key with only one column has only distinct values.
If the integer column refers with a foreign key into another table's
only primary key, we could take advantage of that knolege.
A column with a unique index has only distinct values.
First ones are for classifying and the second ones measure continuous
or discrete time or something like the time.
The upper bound for classifying might be 3 (boolean), or it might be
one million. The properties of the distribution might be hard to guess.
Here is one way:
1. Find out the number of distinct values for 500 rows.
2. Try to guess, how many distinct values are for 1000 rows.
Find out the real number of distinct values for 1000 rows.
3. If the guess and the reality are 50% wrong, do the iteration for
2x1000 rows.
Iterate using a power of two to increase the samples, until you trust the
estimate enough.
So, in the phase two, you could try to guess with two distinct formulas:
One for the classifying target (boolean columns hit there).
Another one for the timestamp and numerical values.
If there are one million classifications on one column, how you
can find it out, by other means than checking at least two million
rows?
This means, that the user should have a possibility to tell the lower
bound for the number of rows for sampling.
Regards,
Marko Ristola
Tom Lane wrote:
Show quoted text
Josh Berkus <josh@agliodbs.com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.
The math in the paper does not seem to look at very low levels of q (=
sample to pop ratio).
The formula has a range of [d,N]. It appears intuitively (i.e. I have
not done any analysis) that at very low levels of q, as f1 moves down
from n, the formula moves down from N towards d very rapidly. I did a
test based on the l_comments field in a TPC lineitems table. The test
set has N = 6001215, D = 2921877. In my random sample of 1000 I got d =
976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2
orders of magnitude.
I wonder if this paper has anything that might help:
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
were more of a statistician I might be able to answer :-)
cheers
andrew
Andrew,
The math in the paper does not seem to look at very low levels of q (=
sample to pop ratio).
Yes, I think that's the failing. Mind you, I did more testing and found out
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy
(which I would consider acceptable) with a sample size of 25% or more (which
is infeasable in any large table). The formula does work for populations
where D/N is much lower, say 0.01. So overall it seems to only work for 1/4
of cases; those where n/N is large and D/N is low. And, annoyingly, that's
probably the population where accurate estimation is least crucial, as it
consists mostly of small tables.
I've just developed (not original, probably, but original to *me*) a formula
that works on populations where n/N is very small and D/N is moderate (i.e.
0.1 to 0.4):
N * (d/n)^(sqrt(N/n))
However, I've tested it only on (n/N < 0.005 and D/N > 0.1 and D/N < 0.4)
populations, and only 3 of them to boot. I'd appreciate other people trying
it on their own data populations, particularly very different ones, like D/N
0.7 or D/N < 0.01.
Further, as Andrew points out we presumably do page sampling rather than
purely random sampling so I should probably read the paper he referenced.
Working on it now ....
--
Josh Berkus
Aglio Database Solutions
San Francisco
Folks,
I wonder if this paper has anything that might help:
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
were more of a statistician I might be able to answer :-)
Actually, that paper looks *really* promising. Does anyone here have enough
math to solve for D(sub)Md on page 6? I'd like to test it on samples of <
0.01%.
Tom, how does our heuristic sampling work? Is it pure random sampling, or
page sampling?
--
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Tom, how does our heuristic sampling work? Is it pure random sampling, or
page sampling?
Manfred probably remembers better than I do, but I think the idea is
to approximate pure random sampling as best we can without actually
examining every page of the table.
regards, tom lane
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote:
Greg Stark wrote
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized
samples only work for things like the histograms that can be analyzed
through standard statistics population sampling which depends on the law of
large numbers.
ISTM Greg's comments are correct. There is no way to calculate this with
consistent accuracy when using a constant sized sample. (If it were,
then people wouldnt bother to hold large databases...)
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.
The only information you can determine from a sample is the smallest
number of distinct values that would reasonably produce the given
sample. There is no meaningful concept of a median one... (You do have
an upper bound: the number of rows in the table, but I cannot see any
meaning from taking (Nrows+estimatedN_distinct)/2 ).
Even if you use Zipf's Law to predict the frequency of occurrence, you'd
still need to estimate the parameters for the distribution.
Most other RDBMS make optimizer statistics collection an unsampled
scan. Some offer this as one of their options, as well as the ability to
define the sample size in terms of fixed number of rows or fixed
proportion of the table.
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately. Then
anybody who doesn't like the estimated statistics has a clear path to
take.
The problem of poorly derived base statistics is a difficult one. When
considering join estimates we already go to the trouble of including MFV
comparisons to ensure an upper bound of join selectivity is known. If
the statistics derived are themselves inaccurate the error propagation
touches every other calculation in the optimizer. GIGO.
What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when they aren't
actually needed ?
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately.
...
What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when they aren't
actually needed ?
It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.
regards, tom lane
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, April 25, 2005 10:23 AM
To: Simon Riggs
Cc: josh@agliodbs.com; Greg Stark; Marko Ristola; pgsql-perform;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?[...]
It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.
How about an option to create a stat hashmap for the column
that maps distinct values to their number of occurrences? Obviously
the map would need to be updated on INSERT/DELETE/UPDATE, but if the
table is dominated by reads, and an accurate n_distinct is very
important, there may be people willing to pay the extra time and space
cost.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
Import Notes
Resolved by subject fallback