Bad n_distinct estimation; hacks suggested?
Folks,
Params: PostgreSQL 8.0.1 on Solaris 10
Statistics = 500
(tablenames have been changed to protect NDA)
e1=# select tablename, null_frac, correlation, n_distinct from pg_stats where
tablename = 'clickstream1' andattname = 'session_id';
tablename | null_frac | correlation | n_distinct
----------------------+-----------+-------------+------------
clickstream1 | 0 | 0.412034 | 378174
(2 rows)
e1=# select count(distinct session_id) from clickstream1;
count
---------
3174813
As you can see, n_distinct estimation is off by a factor of 10x and it's
causing query planning problems. Any suggested hacks to improve the
histogram on this?
(BTW, increasing the stats to 1000 only doubles n_distinct, and doesn't solve
the problem)
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
As you can see, n_distinct estimation is off by a factor of 10x and it's
causing query planning problems. Any suggested hacks to improve the
histogram on this?
What's the histogram itself look like? (I'd like to see the whole
pg_stats row not just part of it ...) There's probably no point in
showing the target=1000 version, but maybe target=100 would be
informative.
regards, tom lane
Tom,
What's the histogram itself look like? (I'd like to see the whole
pg_stats row not just part of it ...) There's probably no point in
showing the target=1000 version, but maybe target=100 would be
informative.
Here is the stats = 100 version. Notice that n_distinct has gone down.
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals
| most_common_freqs
| histogram_bounds |
correlation
------------+----------------------+------------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
public | web_site_activity_fa | session_id | 0 | 8 |
96107 |
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,70986,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,6239825,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,2546720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,4388025}
|
{0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0.000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.000366667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}
|
{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,386486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,6395250,6424719,6888329}
| 0.41744
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Tom,
Any thoughts? This is really messing up query execution all across the
database ...
--Josh
Here is the stats = 100 version. Notice that n_distinct has gone down.
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals| most_common_freqs
| histogram_bounds |correlation
-------------------+------------- public | web_site_activity_fa |
session_id | 0 | 8 | 96107 |
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705
488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006
604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,
4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387
835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23
450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,
71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709
86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982
5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25
46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802
5}{0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000
733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.00
05,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0.
000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036
6667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333
,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.
0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.0002666
67,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0
.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000
233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002333
33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0
002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38
6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,
1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038
573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,
2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832
224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,
3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804
593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,
4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078
912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,
6395250,6424719,6888329}| 0.41744
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Hi.
Sometimes, if the random number generator, that PostgreSQL uses,
isn't good enough, the randomly selected pages for the statistics
might not be random enough.
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Good statistics depend on good random numbers.
So, for example, if you have one million pages, but the upper bound for
the random
numbers is one hundred thousand pages, the statistics might get tuned.
Or some random number generator has for example only 32000 different values.
Regards,
Marko Ristola
Josh Berkus wrote:
Show quoted text
Tom,
Any thoughts? This is really messing up query execution all across the
database ...--Josh
Here is the stats = 100 version. Notice that n_distinct has gone down.
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals| most_common_freqs
| histogram_bounds |correlation
-------------------+------------- public | web_site_activity_fa |
session_id | 0 | 8 | 96107 |
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705
488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006
604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,
4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387
835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23
450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,
71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709
86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982
5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25
46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802
5}{0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000
733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.00
05,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0.
000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036
6667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333
,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.
0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.0002666
67,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0
.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000
233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002333
33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0
002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38
6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,
1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038
573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,
2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832
224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,
3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804
593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,
4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078
912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,
6395250,6424719,6888329}| 0.41744
Marko,
Sometimes, if the random number generator, that PostgreSQL uses,
isn't good enough, the randomly selected pages for the statistics
might not be random enough.Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Hmmm. Good point. Will have to test on Linux.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?Hmmm. Good point. Will have to test on Linux.
Nope:
Linux 2.4.20:
test=# select tablename, attname, n_distinct from pg_stats where tablename =
'web_site_activity_fa';
tablename | attname | n_distinct
----------------------+---------------------+------------
web_site_activity_fa | session_id | 626127
test=# select count(distinct session_id) from web_site_activity_fa;
count
---------
3174813
(1 row)
... I think the problem is in our heuristic sampling code. I'm not the first
person to have this kind of a problem. Will be following up with tests ...
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
... I think the problem is in our heuristic sampling code. I'm not the first
person to have this kind of a problem. Will be following up with tests ...
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized samples
only work for things like the histograms that can be analyzed through standard
statistics population sampling which depends on the law of large numbers.
n_distinct requires you to estimate how frequently very rare things occur. You
can't apply the law of large numbers because even a single instance of a value
out of a large pool alters the results disproportionately.
To get a valid estimate for n_distinct you would need to sample a fixed
percentage of the table. Not a fixed size sample. That just isn't practical.
Moreover, I think the percentage would have to be quite large. Even if you
sampled half the table I think the confidence interval would be quite wide.
The situation is worsened because it's unclear how to interpolate values for
subsets of the table. If the histogram says you have a million records and
you're adding a clause that has a selectivity of 50% then half a million is a
good guess. But if what you care about is n_distinct and you start with a
million records with 1,000 distinct values and then apply a clause that
filters 50% of them, how do you estimate the resulting n_distinct? 500 is
clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0
to 1,000 and you have no good way to figure out where the truth lies.
So I fear this is fundamentally a hopeless situation. It's going to be
difficult to consistently get good plans for any queries that depend on good
estimates for n_distinct.
--
greg
Greg,
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized
samples only work for things like the histograms that can be analyzed
through standard statistics population sampling which depends on the law of
large numbers.
Well, unusual distributions are certainly tough. But I think the problem
exists even for relatively well-distributed populations. Part of it is, I
believe, the formula we are using:
n*d / (n - f1 + f1*n/N)
This is an estimation formula from Haas and Stokes in IBM Research Report RJ
10025, and is called the DUJ1 formula.
(http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf) It appears to
suck. For example, take my table:
rows: 26million (N)
distinct values: 3.4million
I took a random sample of 1000 rows (n) from that table. It contained:
968 values that occurred only once (f1)
981 distinct values (d)
Any human being looking at that sample would assume a large number of distinct
values; after all, 96.8% of the values occurred only once. But the formula
gives us:
1000*981 / ( 1000 - 968 + ( 968 * 1000/26000000 ) ) = 30620
This is obviously dramatically wrong, by a factor of 100. The math gets worse
as the sample size goes down:
Sample 250, 248 distinct values, 246 unique values:
250*248 / ( 250 - 246 + ( 246 * 250 / 26000000 ) ) = 15490
Even in a case with an ovewhelming majority of unique values, the formula gets
it wrong:
999 unque values in sample
998 appearing only once:
1000*999 / ( 1000 - 998 + ( 998 * 1000 / 26000000 ) ) = 490093
This means that, with a sample size of 1000 a table of 26million rows cannot
ever have with this formula more than half a million distinct values, unless
the column is a unique column.
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.
This formula appears broken everywhere:
Table: 969000 rows
Distinct values: 374000
Sample Size: 1000
Unique values in sample: 938
Values appearing only once: 918
1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308
Again, too small by a factor of 20x.
This is so broken, in fact, that I'm wondering if we've read the paper right?
I've perused the paper on almaden, and the DUJ1 formula appears considerably
more complex than the formula we're using.
Can someone whose math is more recent than calculus in 1989 take a look at
that paper, and look at the formula toward the bottom of page 10, and see if
we are correctly interpreting it? I'm particularly confused as to what "q"
and "d-sub-n" represent. Thanks!
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus said:
Well, unusual distributions are certainly tough. But I think the
problem exists even for relatively well-distributed populations.
Part of it is, I believe, the formula we are using:n*d / (n - f1 + f1*n/N)
[snip]
This is so broken, in fact, that I'm wondering if we've read the paper
right? I've perused the paper on almaden, and the DUJ1 formula
appears considerably more complex than the formula we're using.Can someone whose math is more recent than calculus in 1989 take a look
at that paper, and look at the formula toward the bottom of page 10,
and see if we are correctly interpreting it? I'm particularly
confused as to what "q" and "d-sub-n" represent. Thanks!
Math not too recent ...
I quickly read the paper and independently came up with the same formula you
say above we are applying. The formula is on the page that is numbered 6,
although it's the tenth page in the PDF.
q = n/N = ratio of sample size to population size
d_sub_n = d = number of distinct classes in sample
cheers
andrew
People,
Can someone whose math is more recent than calculus in 1989 take a look at
that paper, and look at the formula toward the bottom of page 10, and see
if we are correctly interpreting it? I'm particularly confused as to
what "q" and "d-sub-n" represent. Thanks!
Actually, I managed to solve for these and it appears we are using the formula
correctly. It's just a bad formula.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.
Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(
I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.
regards, tom lane
Here is my opinion.
I hope this helps.
Maybe there is no one good formula:
On boolean type, there are at most 3 distinct values.
There is an upper bound for fornames in one country.
There is an upper bound for last names in one country.
There is a fixed number of states and postal codes in one country.
On the other hand, with timestamp, every value could be distinct.
A primary key with only one column has only distinct values.
If the integer column refers with a foreign key into another table's
only primary key, we could take advantage of that knolege.
A column with a unique index has only distinct values.
First ones are for classifying and the second ones measure continuous
or discrete time or something like the time.
The upper bound for classifying might be 3 (boolean), or it might be
one million. The properties of the distribution might be hard to guess.
Here is one way:
1. Find out the number of distinct values for 500 rows.
2. Try to guess, how many distinct values are for 1000 rows.
Find out the real number of distinct values for 1000 rows.
3. If the guess and the reality are 50% wrong, do the iteration for
2x1000 rows.
Iterate using a power of two to increase the samples, until you trust the
estimate enough.
So, in the phase two, you could try to guess with two distinct formulas:
One for the classifying target (boolean columns hit there).
Another one for the timestamp and numerical values.
If there are one million classifications on one column, how you
can find it out, by other means than checking at least two million
rows?
This means, that the user should have a possibility to tell the lower
bound for the number of rows for sampling.
Regards,
Marko Ristola
Tom Lane wrote:
Show quoted text
Josh Berkus <josh@agliodbs.com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.
The math in the paper does not seem to look at very low levels of q (=
sample to pop ratio).
The formula has a range of [d,N]. It appears intuitively (i.e. I have
not done any analysis) that at very low levels of q, as f1 moves down
from n, the formula moves down from N towards d very rapidly. I did a
test based on the l_comments field in a TPC lineitems table. The test
set has N = 6001215, D = 2921877. In my random sample of 1000 I got d =
976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2
orders of magnitude.
I wonder if this paper has anything that might help:
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
were more of a statistician I might be able to answer :-)
cheers
andrew
Andrew,
The math in the paper does not seem to look at very low levels of q (=
sample to pop ratio).
Yes, I think that's the failing. Mind you, I did more testing and found out
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy
(which I would consider acceptable) with a sample size of 25% or more (which
is infeasable in any large table). The formula does work for populations
where D/N is much lower, say 0.01. So overall it seems to only work for 1/4
of cases; those where n/N is large and D/N is low. And, annoyingly, that's
probably the population where accurate estimation is least crucial, as it
consists mostly of small tables.
I've just developed (not original, probably, but original to *me*) a formula
that works on populations where n/N is very small and D/N is moderate (i.e.
0.1 to 0.4):
N * (d/n)^(sqrt(N/n))
However, I've tested it only on (n/N < 0.005 and D/N > 0.1 and D/N < 0.4)
populations, and only 3 of them to boot. I'd appreciate other people trying
it on their own data populations, particularly very different ones, like D/N
0.7 or D/N < 0.01.
Further, as Andrew points out we presumably do page sampling rather than
purely random sampling so I should probably read the paper he referenced.
Working on it now ....
--
Josh Berkus
Aglio Database Solutions
San Francisco
Folks,
I wonder if this paper has anything that might help:
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
were more of a statistician I might be able to answer :-)
Actually, that paper looks *really* promising. Does anyone here have enough
math to solve for D(sub)Md on page 6? I'd like to test it on samples of <
0.01%.
Tom, how does our heuristic sampling work? Is it pure random sampling, or
page sampling?
--
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Tom, how does our heuristic sampling work? Is it pure random sampling, or
page sampling?
Manfred probably remembers better than I do, but I think the idea is
to approximate pure random sampling as best we can without actually
examining every page of the table.
regards, tom lane
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote:
Greg Stark wrote
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized
samples only work for things like the histograms that can be analyzed
through standard statistics population sampling which depends on the law of
large numbers.
ISTM Greg's comments are correct. There is no way to calculate this with
consistent accuracy when using a constant sized sample. (If it were,
then people wouldnt bother to hold large databases...)
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.
The only information you can determine from a sample is the smallest
number of distinct values that would reasonably produce the given
sample. There is no meaningful concept of a median one... (You do have
an upper bound: the number of rows in the table, but I cannot see any
meaning from taking (Nrows+estimatedN_distinct)/2 ).
Even if you use Zipf's Law to predict the frequency of occurrence, you'd
still need to estimate the parameters for the distribution.
Most other RDBMS make optimizer statistics collection an unsampled
scan. Some offer this as one of their options, as well as the ability to
define the sample size in terms of fixed number of rows or fixed
proportion of the table.
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately. Then
anybody who doesn't like the estimated statistics has a clear path to
take.
The problem of poorly derived base statistics is a difficult one. When
considering join estimates we already go to the trouble of including MFV
comparisons to ensure an upper bound of join selectivity is known. If
the statistics derived are themselves inaccurate the error propagation
touches every other calculation in the optimizer. GIGO.
What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when they aren't
actually needed ?
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately.
...
What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when they aren't
actually needed ?
It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.
regards, tom lane
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, April 25, 2005 10:23 AM
To: Simon Riggs
Cc: josh@agliodbs.com; Greg Stark; Marko Ristola; pgsql-perform;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?[...]
It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.
How about an option to create a stat hashmap for the column
that maps distinct values to their number of occurrences? Obviously
the map would need to be updated on INSERT/DELETE/UPDATE, but if the
table is dominated by reads, and an accurate n_distinct is very
important, there may be people willing to pay the extra time and space
cost.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
Import Notes
Resolved by subject fallback
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately.
...
What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when they aren't
actually needed ?It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.
Assuming you don't use the HashAgg plan, which seems very appropriate
for the task? (...but I understand the plan otherwise).
If that was the issue, then why not keep scanning until you've used up
maintenance_work_mem with hash buckets, then stop and report the result.
The problem is if you don't do the sort once for statistics collection
you might accidentally choose plans that force sorts on that table. I'd
rather do it once...
The other alternative is to allow an ALTER TABLE command to set
statistics manually, but I think I can guess what you'll say to that!
Best Regards, Simon Riggs
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.
Setting up these samples as a % of data pages, rather than a pure random sort,
makes this more feasable; for example, a 70GB table would only need to sample
about 9000 data pages (or 70MB). Of course, larger samples would lead to
better accuracy, and this could be set through a revised GUC (i.e.,
maximum_sample_size, minimum_sample_size).
I just need a little help doing the math ... please?
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Guys,
While it's not possible to get accurate estimates from a fixed size sample,
I think it would be possible from a small but scalable sample: say, 0.1% of
all data pages on large tables, up to the limit of maintenance_work_mem.
BTW, when I say "accurate estimates" here, I'm talking about "accurate enough
for planner purposes" which in my experience is a range between 0.2x to 5x.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus wrote:
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.Setting up these samples as a % of data pages, rather than a pure random sort,
makes this more feasable; for example, a 70GB table would only need to sample
about 9000 data pages (or 70MB). Of course, larger samples would lead to
better accuracy, and this could be set through a revised GUC (i.e.,
maximum_sample_size, minimum_sample_size).I just need a little help doing the math ... please?
After some more experimentation, I'm wondering about some sort of
adaptive algorithm, a bit along the lines suggested by Marko Ristola,
but limited to 2 rounds.
The idea would be that we take a sample (either of fixed size, or some
small proportion of the table) , see how well it fits a larger sample
(say a few times the size of the first sample), and then adjust the
formula accordingly to project from the larger sample the estimate for
the full population. Math not worked out yet - I think we want to ensure
that the result remains bounded by [d,N].
cheers
andrew
-----Original Message-----
From: Josh Berkus [mailto:josh@agliodbs.com]
Sent: Sunday, April 24, 2005 2:08 PM
To: Andrew Dunstan
Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?[...]
Actually, that paper looks *really* promising. Does anyone here have
enough math to solve for D(sub)Md on page 6? I'd like to test it on
samples of < 0.01%.
[...]
D_Md = [1 - sqrt(f_1 / s)] D_b + sqrt(f_1 / s) D_B
s = block size
f~_1 = median frequency within blocks for distinct values occurring in
only one block
D_b = d + f_1^(b+1)
d = distinct classes in the sample
f_1^(b+1) = number of distinct values occurring in a single block in
a sample of b+1 blocks
D_B = d + [B / (b + 1)] f_1^(b+1)
b = sample size (in blocks)
B = total table size (in blocks)
f_k and f~_k are the only tricky functions here, but they are easy to
understand:
Suppose our column contains values from the set {a, b, c, ..., z}.
Suppose we have a sample of b = 10 blocks.
Suppose that the value 'c' occurs in exactly 3 blocks (we don't care
how often it occurs *within* those blocks).
Suppose that the value 'f' also occurs in exactly 3 blocks.
Suppose that the values 'h', 'p', and 'r' occur in exactly 3 blocks.
Suppose that no other value occurs in exactly 3 blocks.
f_3^b = 5
This is because there are 5 distinct values that occur in exactly
3 blocks. f_1^b is the number of distinct values that occur in
exactly 1 block, regardless of how often it occurs within that block.
Note that when you select a sample size of b blocks, you actually
need to sample b+1 blocks to compute f_1^(b+1). This is actually
pedantry since all occurrences of b in the formula are really b+1.
f~ is slightly trickier. First, we pick the distinct values that
occur in only one block. Then, we count how often each value
occurs within its block. To wit:
Suppose we have a set {d, q, y, z} of values that occur in only
one block.
Suppose that d occurs 3x, q occurs 1x, y occurs 8x, and z occurs 6x.
The function f- would take the mean of these counts to determine
the "cluster frequency". So f- here would be 4.5. This allows
one to compute D_MF.
The function f~ takes the median of this sample, which is 3 or 6
(or I suppose you could even take the mean of the two medians if
you wanted).
No tricky math involved. That should be enough to tell you how to
write the estimator.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
Import Notes
Resolved by subject fallback
Simon Riggs <simon@2ndquadrant.com> writes:
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.
Assuming you don't use the HashAgg plan, which seems very appropriate
for the task? (...but I understand the plan otherwise).
The context here is a case with a very large number of distinct
values... keep in mind also that we have to do this for *all* the
columns of the table. A full-table scan for each column seems
right out to me.
regards, tom lane
-----Original Message-----
From: Andrew Dunstan [mailto:andrew@dunslane.net]
Sent: Monday, April 25, 2005 3:43 PM
To: josh@agliodbs.com
Cc: pgsql-perform; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?Josh Berkus wrote:
Simon, Tom:
While it's not possible to get accurate estimates from a
fixed size sample, I think it would be possible from a
small but scalable sample: say, 0.1% of all data pages on
large tables, up to the limit of maintenance_work_mem.
Note that the results obtained in the cited paper were obtained
from samples of 5 and 10%. It should also warrant caution
that the authors don't offer any proofs of confidence bounds,
even for the "average" case.
[...]
After some more experimentation, I'm wondering about some
sort of adaptive algorithm, a bit along the lines suggested
by Marko Ristola, but limited to 2 rounds.
One path might be to use the published algorithm and simply
recompute the statistics after every K blocks are sampled,
where K is a reasonably small number. If it looks like the
statistics are converging on a value, then take a few more
samples, check against the trend value and quit. Otherwise
continue until some artificial limit is reached.
The idea would be that we take a sample (either of fixed
size, or some small proportion of the table), see how well
it fits a larger sample (say a few times the size of the
first sample), and then adjust the formula accordingly to
project from the larger sample the estimate for the full
population. Math not worked out yet - I think we want to
ensure that the result remains bounded by [d,N].
The crudest algorithm could be something like the Newton-
Ralphson method for finding roots. Just adjust the predicted
value up or down until it comes within an error tolerance of
the observed value for the current sample. No need to choose
powers of 2, and I would argue that simply checking every so
often on the way to a large sample that can be terminated
early is more efficient than sampling and resampling. Of
course, the crude algorithm would almost certainly be I/O
bound, so if a more sophisticated algorithm would give a
better prediction by spending a few more CPU cycles on each
sample block gathered, then that seems like a worthwhile
avenue to pursue.
As far as configuration goes, the user is most likely to
care about how long it takes to gather the statistics or
how accurate they are. So it would probably be best to
terminate the sampling process on a user-defined percentage
of the table size and the minimum error tolerance of the
algorithmic prediction value vs. the computed sample value.
If someone wants a fast and dirty statistic, they set the
row percent low and the error tolerance high, which will
effectively make the blocks read the limiting factor. If
they want an accurate statistic, they set the row percent
as high as they feel they can afford, and the error
tolerance as low as they need to in order to get the query
plans they want.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
Import Notes
Resolved by subject fallback
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.
Perhaps the formula is not actually being applied?
The code looks like this...
if (nmultiple == 0)
{
/* If we found no repeated values, assume it's a unique column */
stats->stadistinct = -1.0;
}
else if (toowide_cnt == 0 && nmultiple == ndistinct)
{
/*
* Every value in the sample appeared more than once. Assume
* the column has just these values.
*/
stats->stadistinct = ndistinct;
}
else
{
/*----------
* Estimate the number of distinct values using the estimator
* proposed by Haas and Stokes in IBM Research Report RJ 10025:
The middle chunk of code looks to me like if we find a distribution
where values all occur at least twice, then we won't bother to apply the
Haas and Stokes equation. That type of frequency distribution would be
very common in a set of values with very high ndistinct, especially when
sampled.
The comment
* Every value in the sample appeared more than once. Assume
* the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.
Looking at Josh's application it does seem likely that when taking a
sample, all site visitors clicked more than once during their session,
especially if they include home page, adverts, images etc for each page.
Could it be that we have overlooked this simple explanation and that the
Haas and Stokes equation is actually quite good, but just not being
applied?
Best Regards, Simon Riggs
Simon,
Could it be that we have overlooked this simple explanation and that the
Haas and Stokes equation is actually quite good, but just not being
applied?
That's probably part of it, but I've tried Haas and Stokes on a pure random
sample and it's still bad, or more specifically overly conservative.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.Assuming you don't use the HashAgg plan, which seems very appropriate
for the task? (...but I understand the plan otherwise).The context here is a case with a very large number of distinct
values...
Yes, but is there another way of doing this other than sampling a larger
proportion of the table? I don't like that answer either, for the
reasons you give.
The manual doesn't actually say this, but you can already alter the
sample size by setting one of the statistics targets higher, but all of
those samples are fixed sample sizes, not a proportion of the table
itself. It seems reasonable to allow an option to scan a higher
proportion of the table. (It would be even better if you could say "keep
going until you run out of memory, then stop", to avoid needing to have
an external sort mode added to ANALYZE).
Oracle and DB2 allow a proportion of the table to be specified as a
sample size during statistics collection. IBM seem to be ignoring their
own research note on estimating ndistinct...
keep in mind also that we have to do this for *all* the
columns of the table.
You can collect stats for individual columns. You need only use an
option to increase sample size when required.
Also, if you have a large table and the performance of ANALYZE worries
you, set some fields to 0. Perhaps that should be the default setting
for very long text columns, since analyzing those doesn't help much
(usually) and takes ages. (I'm aware we already don't analyze var length
column values > 1024 bytes).
A full-table scan for each column seems
right out to me.
Some systems analyze multiple columns simultaneously.
Best Regards, Simon Riggs
Simon Riggs wrote:
The comment
* Every value in the sample appeared more than once. Assume
* the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.Looking at Josh's application it does seem likely that when taking a
sample, all site visitors clicked more than once during their session,
especially if they include home page, adverts, images etc for each page.Could it be that we have overlooked this simple explanation and that the
Haas and Stokes equation is actually quite good, but just not being
applied?
No, it is being aplied. If every value in the sample appears more than
once, then f1 in the formula is 0, and the result is then just d, the
number of distinct values in the sample.
cheers
andrew
Hi everybody!
Perhaps the following papers are relevant to the discussion here
(their contact authors have been cc'd):
1. The following proposes effective algorithms for using block-level
sampling for n_distinct estimation:
"Effective use of block-level sampling in statistics estimation"
by Chaudhuri, Das and Srivastava, SIGMOD 2004.
http://www-db.stanford.edu/~usriv/papers/block-sampling.pdf
2. In a single scan, it is possible to estimate n_distinct by using
a very simple algorithm:
"Distinct sampling for highly-accurate answers to distinct value
queries and event reports" by Gibbons, VLDB 2001.
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
3. In fact, Gibbon's basic idea has been extended to "sliding windows"
(this extension is useful in streaming systems like Aurora / Stream):
"Distributed streams algorithms for sliding windows"
by Gibbons and Tirthapura, SPAA 2002.
http://home.eng.iastate.edu/~snt/research/tocs.pdf
Thanks,
Gurmeet
----------------------------------------------------
Gurmeet Singh Manku Google Inc.
http://www.cs.stanford.edu/~manku (650) 967 1890
----------------------------------------------------
-----Original Message-----
From: Gurmeet Manku [mailto:manku@CS.Stanford.EDU]
Sent: Tuesday, April 26, 2005 5:01 PM
To: Simon Riggs
Cc: Tom Lane; josh@agliodbs.com; Greg Stark; Marko Ristola;
pgsql-perform; pgsql-hackers@postgresql.org; Utkarsh Srivastava;
snt@iastate.edu
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?[...]
2. In a single scan, it is possible to estimate n_distinct by using
a very simple algorithm:"Distinct sampling for highly-accurate answers to distinct value
queries and event reports" by Gibbons, VLDB 2001.http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
[...]
This paper looks the most promising, and isn't too different
from what I suggested about collecting stats over the whole table
continuously. What Gibbons does is give a hard upper bound on
the sample size by using a logarithmic technique for storing
sample information. His technique appears to offer very good
error bounds and confidence intervals as shown by tests on
synthetic and real data. I think it deserves a hard look from
people hacking the estimator.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
Import Notes
Resolved by subject fallback
This one looks *really* good.
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
It does require a single full table scan but it works in O(n) time and
constant space and it guarantees the confidence intervals for the estimates it
provides like the histograms do for regular range scans.
It can even keep enough data to provide estimates for n_distinct when
unrelated predicates are applied. I'm not sure Postgres would want to do this
though; this seems like it's part of the cross-column correlation story more
than the n_distinct story. It seems to require keeping an entire copy of the
sampled record in the stats tables which would be prohibitive quickly in wide
tables (it would be O(n^2) storage in the number of columns) .
It also seems like a lot of work to implement. Nothing particular that would
be impossible, but it does require storing a moderately complex data
structure. Perhaps Postgres's new support for data structures will make this
easier.
--
greg
On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
This one looks *really* good.
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
It does require a single full table scan
Ack.. Not by default please.
I have a few large append-only tables (vacuum isn't necessary) which do
need stats rebuilt periodically.
Lets just say that we've been working hard to upgrade to 8.0 primarily
because pg_dump was taking over 18 hours to make a backup.
--
Rod Taylor <rbt@sitesell.com>
Rod Taylor <rbt@sitesell.com> writes:
On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
This one looks *really* good.
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
It does require a single full table scan
Ack.. Not by default please.
I have a few large append-only tables (vacuum isn't necessary) which do
need stats rebuilt periodically.
The algorithm can also naturally be implemented incrementally. Which would be
nice for your append-only tables. But that's not Postgres's current philosophy
with statistics. Perhaps some trigger function that you could install yourself
to update statistics for a newly inserted record would be useful.
The paper is pretty straightforward and easy to read, but here's an executive
summary:
The goal is to gather a uniform sample of *distinct values* in the table as
opposed to a sample of records.
Instead of using a fixed percentage sampling rate for each record, use a hash
of the value to determine whether to include it. At first include everything,
but if the sample space overflows throw out half the values based on their
hash value. Repeat until finished.
In the end you'll have a sample of 1/2^n of your distinct values from your
entire data set where n is large enough for you sample to fit in your
predetermined constant sample space.
--
greg
On Tue, 2005-04-26 at 19:28 -0400, Greg Stark wrote:
Rod Taylor <rbt@sitesell.com> writes:
On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
This one looks *really* good.
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
It does require a single full table scan
Ack.. Not by default please.
I have a few large append-only tables (vacuum isn't necessary) which do
need stats rebuilt periodically.The algorithm can also naturally be implemented incrementally. Which would be
nice for your append-only tables. But that's not Postgres's current philosophy
with statistics. Perhaps some trigger function that you could install yourself
to update statistics for a newly inserted record would be useful.
If when we have partitions, that'll be good enough. If partitions aren't
available this would be quite painful to anyone with large tables --
much as the days of old used to be painful for ANALYZE.
--
Rod Taylor <pg@rbt.ca> writes:
If when we have partitions, that'll be good enough. If partitions aren't
available this would be quite painful to anyone with large tables --
much as the days of old used to be painful for ANALYZE.
Yeah ... I am very un-enthused about these suggestions to make ANALYZE
go back to doing a full scan ...
regards, tom lane
Quoting Andrew Dunstan <andrew@dunslane.net>:
After some more experimentation, I'm wondering about some sort of
adaptive algorithm, a bit along the lines suggested by Marko
Ristola, but limited to 2 rounds.
The idea would be that we take a sample (either of fixed size, or
some small proportion of the table) , see how well it fits a larger
sample
(say a few times the size of the first sample), and then adjust
the > formula accordingly to project from the larger sample the
estimate for the full population. Math not worked out yet - I think we
want to ensure that the result remains bounded by [d,N].
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result than just
taking the larger sample on its own; GIGO. Whether they are consistent
with one another has no relationship to whether the larger sample
correlates with the whole population. You can think of the tiny sample
like "anecdotal" evidence for wonderdrugs.
--
"Dreams come true, not free." -- S.Sondheim, ITW
Tom Lane <tgl@sss.pgh.pa.us> writes:
Rod Taylor <pg@rbt.ca> writes:
If when we have partitions, that'll be good enough. If partitions aren't
available this would be quite painful to anyone with large tables --
much as the days of old used to be painful for ANALYZE.Yeah ... I am very un-enthused about these suggestions to make ANALYZE
go back to doing a full scan ...
Well one option would be to sample only a small number of records, but add the
data found from those records to the existing statistics. This would make
sense for a steady-state situation, but make it hard to recover from a drastic
change in data distribution. I think in the case of n_distinct it would also
bias the results towards underestimating n_distinct but perhaps that could be
corrected for.
But I'm unclear for what situation this is a concern.
For most use cases users have to run vacuum occasionally. In those cases
"vacuum analyze" would be no worse than a straight normal vacuum. Note that
this algorithm doesn't require storing more data because of the large scan or
performing large sorts per column. It's purely O(n) time and O(1) space.
On the other hand, if you have tables you aren't vacuuming that means you
perform zero updates or deletes. In which case some sort of incremental
statistics updating would be a good solution. A better solution even than
sampling.
--
greg
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote:
2. In a single scan, it is possible to estimate n_distinct by using
a very simple algorithm:"Distinct sampling for highly-accurate answers to distinct value
queries and event reports" by Gibbons, VLDB 2001.http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
That looks like the one...
...though it looks like some more complex changes to the current
algorithm to use it, and we want the other stats as well...
3. In fact, Gibbon's basic idea has been extended to "sliding windows"
(this extension is useful in streaming systems like Aurora / Stream):"Distributed streams algorithms for sliding windows"
by Gibbons and Tirthapura, SPAA 2002.
...and this offers the possibility of calculating statistics at load
time, as part of the COPY command
Best Regards, Simon Riggs
Mischa Sandberg wrote:
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result than just
taking the larger sample on its own; GIGO. Whether they are consistent
with one another has no relationship to whether the larger sample
correlates with the whole population. You can think of the tiny sample
like "anecdotal" evidence for wonderdrugs.
Ok, good point.
I'm with Tom though in being very wary of solutions that require even
one-off whole table scans. Maybe we need an additional per-table
statistics setting which could specify the sample size, either as an
absolute number or as a percentage of the table. It certainly seems that
where D/N ~ 0.3, the estimates on very large tables at least are way way
out.
Or maybe we need to support more than one estimation method.
Or both ;-)
cheers
andrew
-----Original Message-----
From: Greg Stark [mailto:gsstark@mit.edu]
Sent: Wednesday, April 27, 2005 1:00 AM
To: Tom Lane
Cc: Rod Taylor; Greg Stark; pgsql-hackers@postgresql.org;
Gurmeet Manku
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?Tom Lane <tgl@sss.pgh.pa.us> writes:
Rod Taylor <pg@rbt.ca> writes:
If when we have partitions, that'll be good enough. If
partitions aren't available this would be quite painful
to anyone with large tables -- much as the days of old
used to be painful for ANALYZE.Yeah ... I am very un-enthused about these suggestions to
make ANALYZE go back to doing a full scan ...
I don't see why ANALYZE would always have to do a full scan.
Clearly, this statistic would only be useful to people who need
a very accurate n_distinct on tables where the current metric
does not work very well. Applying a specialized solution to
every table doesn't seem like an efficient way to go about
things. Instead, the distinct sampling mechanism should be
purely optional, and probably purely separate from the vanilla
ANALYZE mechanism, because it works differently. If it were
designed that way, the full table scan would be a one-time
cost that would not even need to be paid if the user turned
on this mechanism at table creation. Thereafter, the statistic
would need to be updated incrementally, but that just
distributes the cost of ANALYZE over the INSERT/UPDATE/DELETEs.
Obviously, it's a higher cost because you touch every record
that hits the table, but that's the price you pay for a good
n_distinct.
The block estimator should probably become the default, since
it works within the current ANALYZE paradigm of sampling the
data.
[...]
For most use cases users have to run vacuum occasionally. In
those cases "vacuum analyze" would be no worse than a straight
normal vacuum.
And that's only if you do a full table scan every time. In the
incremental implementation, there are no lump sum costs involved
except when the statistic is first initialized.
Note that this algorithm doesn't require storing more data
because of the large scan or performing large sorts per
column. It's purely O(n) time and O(1) space.
And I think it should be emphasized that distinct sampling not
only gives you a good n_distinct for query planning, it also
gives you a very fast approximate answer for related aggregate
queries. So you're getting more than just query tuning for that
cost.
On the other hand, if you have tables you aren't vacuuming
that means you perform zero updates or deletes. In which case
some sort of incremental statistics updating would be a good
solution. A better solution even than sampling.
Which, for the large data warehousing situations where it seems
this mechanism would be most useful, this would probably be the
most common case.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
Import Notes
Resolved by subject fallback
Mischa,
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result than just
taking the larger sample on its own; GIGO. Whether they are consistent
with one another has no relationship to whether the larger sample
correlates with the whole population. You can think of the tiny sample
like "anecdotal" evidence for wonderdrugs.
Actually, it's more to characterize how large of a sample we need. For
example, if we sample 0.005 of disk pages, and get an estimate, and then
sample another 0.005 of disk pages and get an estimate which is not even
close to the first estimate, then we have an idea that this is a table which
defies analysis based on small samples. Wheras if the two estimates are <
1.0 stdev apart, we can have good confidence that the table is easily
estimated. Note that this doesn't require progressively larger samples; any
two samples would work.
I'm with Tom though in being very wary of solutions that require even
one-off whole table scans. Maybe we need an additional per-table
statistics setting which could specify the sample size, either as an
absolute number or as a percentage of the table. It certainly seems that
where D/N ~ 0.3, the estimates on very large tables at least are way way
out.
Oh, I think there are several other cases where estimates are way out.
Basically the estimation method we have doesn't work for samples smaller than
0.10.
Or maybe we need to support more than one estimation method.
Yes, actually. We need 3 different estimation methods:
1 for tables where we can sample a large % of pages (say, >= 0.1)
1 for tables where we sample a small % of pages but are "easily estimated"
1 for tables which are not easily estimated by we can't afford to sample a
large % of pages.
If we're doing sampling-based estimation, I really don't want people to lose
sight of the fact that page-based random sampling is much less expensive than
row-based random sampling. We should really be focusing on methods which
are page-based.
--
Josh Berkus
Aglio Database Solutions
San Francisco
-----Original Message-----
From: Josh Berkus [mailto:josh@agliodbs.com]
Sent: Wednesday, April 27, 2005 10:25 AM
To: Andrew Dunstan
Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?[...]
Actually, it's more to characterize how large of a sample
we need. For example, if we sample 0.005 of disk pages, and
get an estimate, and then sample another 0.005 of disk pages
and get an estimate which is not even close to the first
estimate, then we have an idea that this is a table which
defies analysis based on small samples.
I buy that.
Wheras if the two estimates are < 1.0 stdev apart, we can
have good confidence that the table is easily estimated.
I don't buy that. A negative indication is nothing more than
proof by contradiction. A positive indication is mathematical
induction over the set, which in this type of context is
logically unsound. There is no reason to believe that two
small samples with a small difference imply that a table is
easily estimated rather than that you got unlucky in your
samples.
[...]
Yes, actually. We need 3 different estimation methods:
1 for tables where we can sample a large % of pages
(say, >= 0.1)
1 for tables where we sample a small % of pages but are
"easily estimated"
1 for tables which are not easily estimated by we can't
afford to sample a large % of pages.
I don't buy that the first and second need to be different
estimation methods. I think you can use the same block
sample estimator for both, and simply stop sampling at
different points. If you set the default to be a fixed
number of blocks, you could get a large % of pages on
small tables and a small % of pages on large tables, which
is exactly how you define the first two cases. However,
I think such a default should also be overridable to a
% of the table or a desired accuracy.
Of course, I would recommend the distinct sample technique
for the third case.
If we're doing sampling-based estimation, I really don't
want people to lose sight of the fact that page-based random
sampling is much less expensive than row-based random
sampling. We should really be focusing on methods which
are page-based.
Of course, that savings comes at the expense of having to
account for factors like clustering within blocks. So block
sampling is more efficient, but can also be less accurate.
Nonetheless, I agree that of the sampling estimators, block
sampling is the better technique.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
Import Notes
Resolved by subject fallback
"Dave Held" <dave.held@arraysg.com> writes:
Actually, it's more to characterize how large of a sample
we need. For example, if we sample 0.005 of disk pages, and
get an estimate, and then sample another 0.005 of disk pages
and get an estimate which is not even close to the first
estimate, then we have an idea that this is a table which
defies analysis based on small samples.I buy that.
Better yet is to use the entire sample you've gathered of .01 and then perform
analysis on that sample to see what the confidence interval is. Which is
effectively the same as what you're proposing except looking at every possible
partition.
Unfortunately the reality according to the papers that were sent earlier is
that you will always find the results disappointing. Until your sample is
nearly the entire table your estimates for n_distinct will be extremely
unreliable.
--
greg
Quoting Josh Berkus <josh@agliodbs.com>:
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result than just
taking the larger sample on its own; GIGO. Whether they are consistent
with one another has no relationship to whether the larger sample
correlates with the whole population. You can think of the tiny sample
like "anecdotal" evidence for wonderdrugs.Actually, it's more to characterize how large of a sample we need. For
example, if we sample 0.005 of disk pages, and get an estimate, and then
sample another 0.005 of disk pages and get an estimate which is not even
close to the first estimate, then we have an idea that this is a table
which
defies analysis based on small samples. Wheras if the two estimates
are <
1.0 stdev apart, we can have good confidence that the table is easily
estimated. Note that this doesn't require progressively larger
samples; any
two samples would work.
We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?
About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.
--
"Dreams come true, not free." -- S.Sondheim, ITW
First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.
(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )
I haven't looked about the PostgreSQL code, so I don't know, that what
is possible
now, and what is not. I do know, that the full table scan and after that
incremental
statistics changes are a very big change, without looking at the code.
I meant the following idea:
- compare two equal sized samples. Then redo the same thing with double
sized samples. So do lots of unnecessary work.
Check out the correlation of the two samples to try to guess the
distribution.
So I tried to give you an idea, not to give you a full answer into the
whole problem.
I did read some parts of the attached PDFs. They did convince me,
that it seems, that the heuristics for the hard cases would actually read
almost the whole table in many cases.
I did cover the "too little sample" problem by stating that the
user should be able to give the minimum size of samples. This way you would
avoid the too small sampling problem. My purpose was not to achieve at
most 5% wrong estimates, but to decrease the 2000% wrong estimates, that
are
seen now sometimes.
Conclusions:
- No heuristics or similar thing of small samples will grant excellent
results.
- If you need excellent estimates, you need to process the whole table!
- Some special cases, like primary keys and the unique indexes and special
case column types do give easy ways to make estimates:
For example, wether a boolean column has zero, one or two distinct
values, it does not matter
so much ??? Hashing seems the right choise for all of them.
If I have understund correctly, the full table scans are out of
questions for large tables at this time.
The percentage idea of taking 10% samples seems good.
So here is another suggestion:
1. Do a full percentage scan, starting at an arbitrary position. If the
user's data is not
homogenous, this hurts it, but this way it is faster.
During that scan, try to figure out all those columns, that have at most
100 distinct values.
Of course, with it you can't go into 100% accuracy, but if the full
table scan is out of question now,
it is better, if the accuracy is for example at most ten times wrong.
You could also improve accuracy by instead of doing a 10% partial table
scan, you could
do 20 pieces of 0,5 percent partial table scans: This would improve
accuracy a bit, but keep
the speed almost the same as the partial table scan.
Here are questions for the statisticians for distinct values calculation:
If we want at most 1000% tolerance, how big percentage of table's one
column must be processed?
If we want at most 500% tolerance, how big percentage of table's one
column must be processed?
If we want at most 250% tolerance, how big percentage of table's one
column must be processed?
Better to assume, that there are at most 100 distinct values on a table,
if it helps calculations.
If we try to get as much with one discontinuous partial table scan
(0,1-10% sample), here is the information, we can gather:
1. We could gather a histogram for max(100) distinct values for each
column for every column.
2. We could measure variance and average, and the number of rows for
these 100 distinct values.
3. We could count the number of rows, that didn't match with these 100
distinct values:
they were left out from the histogram.
4. We could get a minimum and a maximum value for each column.
=> We could get exact information about the sample with one 0,1-10% pass
for many columns.
What you statisticans can gather about these values?
My idea is programmatical combined with statistics:
+ Performance: scan for example 100 blocks each of size 100Mb, because
disc I/O
is much faster this way.
+ Enables larger table percentage. I hope it helps with the statistics
formula.
Required because of more robust statistics: take those blocks at random
(not over each other) places to decrease the effect from hitting
into statistically
bad parts on the table.
+ Less table scan passes: scan all columns with limited hashing in the
first pass.
+ All easy columns are found here with one pass.
+- Harder columns need an own pass each, but we have some preliminary
knoledge of them on the given sample after all (minimum and maximum
values
and the histogram of the 100 distinct values).
Marko Ristola
Greg Stark wrote:
Show quoted text
"Dave Held" <dave.held@arraysg.com> writes:
Actually, it's more to characterize how large of a sample
we need. For example, if we sample 0.005 of disk pages, and
get an estimate, and then sample another 0.005 of disk pages
and get an estimate which is not even close to the first
estimate, then we have an idea that this is a table which
defies analysis based on small samples.I buy that.
Better yet is to use the entire sample you've gathered of .01 and then perform
analysis on that sample to see what the confidence interval is. Which is
effectively the same as what you're proposing except looking at every possible
partition.Unfortunately the reality according to the papers that were sent earlier is
that you will always find the results disappointing. Until your sample is
nearly the entire table your estimates for n_distinct will be extremely
unreliable.
Well, this guy has it nailed. He cites Flajolet and Martin, which was (I
thought) as good as you could get with only a reasonable amount of memory per
statistic. Unfortunately, their hash table is a one-shot deal; there's no way
to maintain it once the table changes. His incremental update doesn't degrade
as the table changes. If there isn't the same wrangle of patent as with the
ARC algorithm, and if the existing stats collector process can stand the extra
traffic, then this one is a winner.
Many thanks to the person who posted this reference in the first place; so
sorry I canned your posting and can't recall your name.
Now, if we can come up with something better than the ARC algorithm ...
Now, if we can come up with something better than the ARC algorithm ...
Tom already did. His clock-sweep patch is already in the 8.1 source.
--
Josh Berkus
Aglio Database Solutions
San Francisco
Actually, the earliest paper that solves the distinct_n estimation
problem in 1 pass is the following:
"Estimating simple functions on the union of data streams"
by Gibbons and Tirthapura, SPAA 2001.
http://home.eng.iastate.edu/~snt/research/streaming.pdf
The above paper addresses a more difficult problem (1 pass
_and_ a distributed setting).
Gibbon's followup paper in VLDB 2001 limits the problem to a
single machine and contains primarily experimental results (for
a database audience). The algorithmic breakthrough had already been
accomplished in the SPAA paper.
Gurmeet
--
----------------------------------------------------
Gurmeet Singh Manku Google Inc.
http://www.cs.stanford.edu/~manku (650) 967 1890
----------------------------------------------------
Hi, Josh,
Josh Berkus wrote:
Yes, actually. We need 3 different estimation methods:
1 for tables where we can sample a large % of pages (say, >= 0.1)
1 for tables where we sample a small % of pages but are "easily estimated"
1 for tables which are not easily estimated by we can't afford to sample a
large % of pages.If we're doing sampling-based estimation, I really don't want people to lose
sight of the fact that page-based random sampling is much less expensive than
row-based random sampling. We should really be focusing on methods which
are page-based.
Would it make sense to have a sample method that scans indices? I think
that, at least for tree based indices (btree, gist), rather good
estimates could be derived.
And the presence of a unique index should lead to 100% distinct values
estimation without any scan at all.
Markus
Hello Everybody,
We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the
foreign key constraints compile noticeably faster. In 7.3 the
constraints would typically take more than an hour to run on our
production data. Now they take a minute or two.
Can anybody explain such a major performance improvement ?
Thanks
--
Ashish Arte
Open Sky Software
Ashish Arte <ashish@openskysoftware.com> writes:
We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the
foreign key constraints compile noticeably faster. In 7.3 the
constraints would typically take more than an hour to run on our
production data. Now they take a minute or two.
Can anybody explain such a major performance improvement ?
Hey, we do do some work on this thing from time to time ;-)
Probably you are talking about this:
2003-10-06 12:38 tgl
* src/: backend/commands/tablecmds.c,
backend/utils/adt/ri_triggers.c, include/commands/trigger.h: During
ALTER TABLE ADD FOREIGN KEY, try to check the existing rows using a
single LEFT JOIN query instead of firing the check trigger for each
row individually. Stephan Szabo, with some kibitzing from Tom Lane
and Jan Wieck.
regards, tom lane
Quoting Markus Schaber <schabi@logix-tt.com>:
Hi, Josh,
Josh Berkus wrote:
Yes, actually. We need 3 different estimation methods:
1 for tables where we can sample a large % of pages (say, >= 0.1)
1 for tables where we sample a small % of pages but are "easilyestimated"
1 for tables which are not easily estimated by we can't afford to
sample a
large % of pages.
If we're doing sampling-based estimation, I really don't want
people to lose
sight of the fact that page-based random sampling is much less
expensive than
row-based random sampling. We should really be focusing on
methods which
are page-based.
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)
How about applying the distinct-sampling filter on a small extra data
stream to the stats collector?
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
Mischa,
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)
Bad analogy, but funny.
The issue with page-based vs. pure random sampling is that to do, for example,
10% of rows purely randomly would actually mean loading 50% of pages. With
20% of rows, you might as well scan the whole table.
Unless, of course, we use indexes for sampling, which seems like a *really
good* idea to me ....
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus wrote:
Mischa,
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)Bad analogy, but funny.
The issue with page-based vs. pure random sampling is that to do, for example,
10% of rows purely randomly would actually mean loading 50% of pages. With
20% of rows, you might as well scan the whole table.Unless, of course, we use indexes for sampling, which seems like a *really
good* idea to me ....
But doesn't an index only sample one column at a time, whereas with
page-based sampling, you can sample all of the columns at once. And not
all columns would have indexes, though it could be assumed that if a
column doesn't have an index, then it doesn't matter as much for
calculations such as n_distinct.
But if you had 5 indexed rows in your table, then doing it index wise
means you would have to make 5 passes instead of just one.
Though I agree that page-based sampling is important for performance
reasons.
John
=:->
John,
But doesn't an index only sample one column at a time, whereas with
page-based sampling, you can sample all of the columns at once.
Hmmm. Yeah, we're not currently doing that though. Another good idea ...
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Quoting Josh Berkus <josh@agliodbs.com>:
Mischa,
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under thestreetlight,
rather than in the alley where you dropped them :-)
Bad analogy, but funny.
Bad analogy? Page-sampling effort versus row-sampling effort, c'est
moot. It's not good enough for stats to produce good behaviour on the
average. Straight random sampling, page or row, is going to cause
enough untrustworthy engine behaviour,for any %ages small enough to
allow sampling from scratch at any time.
I'm curious what the problem is with relying on a start-up plus
incremental method, when the method in the distinct-sampling paper
doesn't degenerate: you can start when the table is still empty.
Constructing an index requires an initial full scan plus incremental
update; what's the diff?
Unless, of course, we use indexes for sampling, which seems like a
*really
good* idea to me ....
"distinct-sampling" applies for indexes, too. I started tracking the
discussion of this a bit late. Smart method for this is in VLDB'92:
Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I
don't think this is online anywhere, except if you have a DBLP
membership. Does nybod else know better?
Antoshenkov was the brains behind some of the really cool stuff in DEC
Rdb (what eventually became Oracle). Compressed bitmap indices,
parallel competing query plans, and smart handling of keys with
hyperbolic distributions.
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.