Re: analyze.c
Hi!
About analyze.c:
If taken out vacuum, couldn't it be completly taken out of pg? Say,
to an external program? What's the big reason not to do that? I know that
there is some code in analyze.c (like comparing) that uses other parts of
pg, but that seems to be easily fixed.I'm leaning toward the implementation of end-biased histograms. There is
an introductory reference in the IEEE Data Engineering Bulletin, september
1995 (available on microsoft research site).
Why take it out of the backend? Seems like a real pain, especially when
you realize what functions it would have to call.
Also, keep in mind that the current analyze generates perfect estimates for
columns containing only two unique values, and columns containing only
unique values. All other cases generate imperfect statistics.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Import Notes
Reply to msg id not found: Pine.LNX.4.21.0008231217570.3985-100000@eros.si.fct.unl.ptISO-8859-1Qfrom_Tiago_AntE3o_at_Aug_232C_2000_123A183A19_pm
BTW, you can get access to SIGMOD CDs with lots of goodies for a very low
price (at least in 1999 it was a bargain), check out ACM membership for
sigmod.I've been reading something about implementation of histograms, and,
AFAIK, in practice histograms is just a cool name for no more than:
1. top ten with frequency for each
2. the same for top ten worse
3. average for the rest
I wonder if just increasing the number of buckets in analyze.c would
help?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Import Notes
Reply to msg id not found: Pine.LNX.4.21.0008231742420.5111-100000@eros.si.fct.unl.ptISO-8859-1Qfrom_Tiago_AntE3o_at_Aug_232C_2000_063A223A40_pm | Resolved by subject fallback
I've been reading something about implementation of histograms, and,
AFAIK, in practice histograms is just a cool name for no more than:
1. top ten with frequency for each
2. the same for top ten worse
3. average for the rest
Consider, that we only need that info for choice of index, and if an average value was too
frequent for this index to be efficient you can safely drop the index, it would be useless.
Thus it seems to me that keeping stats on the most infrequent values (point 2) is useless.
For me these would also be the most volatile, thus the stats would only be
accurate for a short period of time.
I think what we need is as follows:
1. our current histograms
2. a list of exceptions for exceptional values that are very frequent
Exceptional are those values that would skew the distribution too much.
Very infrequent values should not be used for min|max values of histogram buckets,
but that is imho all that needs to be done for infrequent values.
Andreas
Import Notes
Resolved by subject fallback