Statistics and selectivity estimation for ranges
Hackers,
attached patch is for collecting statistics and selectivity estimation for
ranges.
In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of range
into some kind of 2d-histogram. However, this patch use some simplification
and assume distribution of lower bound and distribution of length to be
independent. We can get distribution of lower bound from standard scalar
statistics and thi patch additionally collect statistics for range length.
This patch includes selectivity estimations for "&&", "@>" and "<@"
operators on ranges. Linear interpolation is used in order to get more
accurate results.
Some examples with test dataset where left bound of range is distributed by
gaussian distribution and length of range is distributed by exponential
distribution.
test=# CREATE TABLE range_test as (SELECT int4range(lb, lb + len) AS r FROM
(SELECT (sqrt(-2*ln(random())) * sin(2*pi()*random()) * 1000000)::int as
lb, (-10000*ln(1.0 - random()) + 1)::int as len FROM
generate_series(1,1000000)) x);
SELECT 1000000
test=# ANALYZE range_test;
ANALYZE
test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(700000,710000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=7535 width=14) (actual
time=0.119..403.494 rows=6138 loops=1)
Filter: (r && '[700000,710000)'::int4range)
Rows Removed by Filter: 993862
Total runtime: 403.945 ms
(4 rows)
test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(200000,300000);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*42427* width=14)
(actual time=0.100..401.079 rows=*42725* loops=1)
Filter: (r && '[200000,300000)'::int4range)
Rows Removed by Filter: 957275
Total runtime: 403.055 ms
(4 rows)
test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(100000,150000);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*15341* width=14)
(actual time=0.129..382.114 rows=*16014* loops=1)
Filter: (r <@ '[100000,150000)'::int4range)
Rows Removed by Filter: 983986
Total runtime: 382.985 ms
(4 rows)
test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(600000,603000);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*122* width=14) (actual
time=1.527..383.511 rows=*127* loops=1)
Filter: (r <@ '[600000,603000)'::int4range)
Rows Removed by Filter: 999873
Total runtime: 383.586 ms
(4 rows)
test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(100000,100400);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*5166* width=14) (actual
time=0.238..377.712 rows=*3909* loops=1)
Filter: (r @> '[100000,100400)'::int4range)
Rows Removed by Filter: 996091
Total runtime: 378.018 ms
(4 rows)
test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(500000,530000);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*342* width=14) (actual
time=11.796..382.986 rows=*171* loops=1)
Filter: (r @> '[500000,530000)'::int4range)
Rows Removed by Filter: 999829
Total runtime: 383.066 ms
(4 rows)
------
With best regards,
Alexander Korotkov.
Attachments:
On 04.08.2012 12:31, Alexander Korotkov wrote:
Hackers,
attached patch is for collecting statistics and selectivity estimation for
ranges.In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of range
into some kind of 2d-histogram. However, this patch use some simplification
and assume distribution of lower bound and distribution of length to be
independent.
Sounds reasonable. Another possibility would be to calculate the average
length for each lower-bound bin. So you would e.g know the average
length of values with lower bound between 1-10, and the average length
of values with lower bound between 10-20, and so forth. Within a bin,
you would have to assume that the distribution of the lengths is fixed.
PS. get_position() should guard against division by zero, when subdiff
returns zero.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Having statistics on ranges was really missing! The planner was doing
some really, really bad choices on bigger tables regarding seq/random
scans, nested loop/other joins etc.
Is there any chance this makes it into 9.2 final? It would really
round-off the introduction of range types and maybe avoid problems
like "the new range types are slow" (just due to the bad row
estimates).
Thanks for implementing this feature,
-Matthias
Import Notes
Resolved by subject fallback
Having statistics on ranges was really missing! The planner was doing
some really, really bad choices on bigger tables regarding seq/random
scans, nested loop/other joins etc.
Is there any chance this makes it into 9.2 final? It would really
round-off the introduction of range types and maybe avoid problems
like "the new range types are slow" (just due to the bad row
estimates).
Thanks for implementing this feature,
-Matthias
Is there any chance this makes it into 9.2 final? It would really
round-off the introduction of range types and maybe avoid problems
like "the new range types are slow" (just due to the bad row
estimates).
Nope, that's strictly a 9.3 feature. 9.2 is in beta2.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On Mon, Aug 6, 2012 at 6:09 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:
On 04.08.2012 12:31, Alexander Korotkov wrote:
Hackers,
attached patch is for collecting statistics and selectivity estimation for
ranges.In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of
range
into some kind of 2d-histogram. However, this patch use some
simplification
and assume distribution of lower bound and distribution of length to be
independent.Sounds reasonable. Another possibility would be to calculate the average
length for each lower-bound bin. So you would e.g know the average length
of values with lower bound between 1-10, and the average length of values
with lower bound between 10-20, and so forth. Within a bin, you would have
to assume that the distribution of the lengths is fixed.
Interesting idea. AFAICS, if we store average length for each lower-bound
bin, we still have to assume some kind of distribution of range length in
order to do estimates. For example, assume that range length have
exponential distribution. Correspondingly, we've following trade off: we
don't have to assume lower bound distribution to be independent from length
distribution, but we have to assume kind of length distribution. Actually,
I don't know what is better.
Ideally, we would have range length histogram for each lower-bound bin, or
upper-bound histogram for each lower-bound bin. But, storing such amount of
data seems too expensive.
------
With best regards,
Alexander Korotkov.
For testing statistics accuracy I've used same datasets as for testing
opclasses performance:
http://archives.postgresql.org/pgsql-hackers/2012-07/msg00414.php
Script for testing and database schema is attached.
Dump with tests results can be downloaded here:
http://test.i-gene.ru/uploads/range_stat_tests.sql.gz
Following table shows statistics of accuracy when actual count of rows is
somewhat large (>=10). Second column shows average ratio of estimate count
of rows to actual count of rows. Third column shows average relative error
of estimation.
range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.27166784340153 | 0.498570654434906
@> | 1.35965412121763 | 0.384991198200582
&& | 1.08236985243139 | 0.105298599354035
(3 rows)
When result set is small (1-9 rows) then errors are more significant.
range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.51371646596783 | 2.85624536756285
@> | 3.85482923324034 | 2.91433432363562
&& | 3.14281204906205 | 2.28899260461761
(3 rows)
Following table presents average estimate count of rows when actual count
of rows is 0. This value is quite high for && operator, but it comes from
only one tests, so it's not really representative.
range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+---------------------+-------------
<@ | 1.1259887005649718 | 1770
@> | 1.0598670878194025 | 88329
&& | 28.0000000000000000 | 1
(3 rows)
Same tables for statistics target = 1000.
range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.17132962269887 | 0.394427785424827
@> | 1.35677772347908 | 0.376171286348914
&& | 1.06762781136499 | 0.0874012522386387
(3 rows)
range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.30836881177966 | 2.64459517657192
@> | 3.47535917820028 | 2.55199556747496
&& | 2.49181718664477 | 1.49181718664477
(3 rows)
range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1650879566982409 | 739
@> | 1.0511811463771843 | 89447
(2 rows)
My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.
------
With best regards,
Alexander Korotkov.
New revision of patch with two fixes:
1) Check if histogram bin width is zero in get_position.
2) Check statsTuple is valid tuple in rangecontsel.
------
With best regards,
Alexander Korotkov.
Attachments:
On Thu, Aug 9, 2012 at 12:44 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.
ITSM, I found reason of inaccuracy. Implementation of linear interpolation
was wrong. Fixed version is attached. Now, need to rerun tests, possible
refactoring and comments rework.
------
With best regards,
Alexander Korotkov.
Attachments:
range_stat-0.4.patch.gzapplication/x-gzip; name=range_stat-0.4.patch.gzDownload
On Mon, Aug 13, 2012 at 1:11 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
On Thu, Aug 9, 2012 at 12:44 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.ITSM, I found reason of inaccuracy. Implementation of linear interpolation
was wrong. Fixed version is attached. Now, need to rerun tests, possible
refactoring and comments rework.
After fixing few more bugs, I've a version with much more reasonable
accuracy.
Statistics target = 100.
Relatively large result sets (>= 10)
test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00404179116863 | 0.0504415454560903
@> | 1.06364108531688 | 0.105646077989812
&& | 1.00757984721409 | 0.0420984234933233
(3 rows)
Small result sets (1 - 9)
test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.31530838062865 | 0.654886592410495
@> | 2.78708078320147 | 1.94124123003433
&& | 1.93268112525538 | 1.09904919063335
(3 rows)
Empty result sets
test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1437670609645132 | 1099
@> | 1.0479430126460701 | 87458
(2 rows)
Statistics target = 1000.
Relatively large result sets (>= 10)
test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00073999445381 | 0.045099762607524
@> | 1.05296320350853 | 0.0907489633452971
&& | 1.00217602359039 | 0.0353421159150165
(3 rows)
Small result sets (1 - 9)
test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.26946358795998 | 0.577803898836364
@> | 2.69000633430211 | 1.83165424646645
&& | 1.48715184186882 | 0.577998652291105
(3 rows)
Empty result sets
test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.0887096774193548 | 1364
@> | 1.0423876983771183 | 89224
&& | 5.0000000000000000 | 1
(3 rows)
------
With best regards,
Alexander Korotkov.
Attachments:
range_stat-0.6.patch.gzapplication/x-gzip; name=range_stat-0.6.patch.gzDownload+1-0
On 14.08.2012 09:45, Alexander Korotkov wrote:
After fixing few more bugs, I've a version with much more reasonable
accuracy.
Great! One little thing just occurred to me:
You're relying on the regular scalar selectivity estimators for the <<,
, &< and &> operators. That seems bogus, in particular for << and &<,
because ineq_histogram_selectivity then performs a binary search of the
histogram using those operators. << and &< compare the *upper* bound of
the value in table against the lower bound of constant, but the
histogram is constructed using regular < operator, which sorts the
entries by lower bound. I think the estimates you now get for those
operators are quite bogus if there is a non-trivial amount of overlap
between ranges. For example:
postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);
QUERY PLAN
--------------------------------------------------------------------------------
-----------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14)
(actual time=0.
060..1340.147 rows=200000 loops=1)
Filter: (r << '[200000,200001)'::int4range)
Rows Removed by Filter: 800000
Total runtime: 1371.865 ms
(4 rows)
It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note
that the estimation of overlap selectivity could be implemented using
separate histograms of lower bounds and upper bounds, without requiring
a histogram of range lengths, because a && b == NOT (a << b OR a >> b).
I'm not sure if the estimates we'd get that way would be better or worse
than your current method, but I think it would be easier to understand.
I don't think the &< and &> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.
The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length
as argument, and returns the fraction of tuples with length >= argument.
It would do the lookup in the length histogram to find the right
histogram bin, and do the linear interpolation within the bin. You're
assuming that length is independent of lower/upper bound, so you
shouldn't need any other parameters than range length for that estimation.
You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops
significantly, but I wasn't sure if there's some more intelligence in
the way you combine values from the length and lower bound histograms
that you couldn't do with such a helper function.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:
On 14.08.2012 09:45, Alexander Korotkov wrote:
After fixing few more bugs, I've a version with much more reasonable
accuracy.Great! One little thing just occurred to me:
You're relying on the regular scalar selectivity estimators for the <<,
, &< and &> operators. That seems bogus, in particular for << and &<,
because ineq_histogram_selectivity then performs a binary search of the
histogram using those operators. << and &< compare the *upper* bound of the
value in table against the lower bound of constant, but the histogram is
constructed using regular < operator, which sorts the entries by lower
bound. I think the estimates you now get for those operators are quite
bogus if there is a non-trivial amount of overlap between ranges. For
example:postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);
QUERY PLAN------------------------------**------------------------------**
--------------------
------------------------------**-----
Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14) (actual
time=0.
060..1340.147 rows=200000 loops=1)
Filter: (r << '[200000,200001)'::int4range)
Rows Removed by Filter: 800000
Total runtime: 1371.865 ms
(4 rows)It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note that
the estimation of overlap selectivity could be implemented using separate
histograms of lower bounds and upper bounds, without requiring a histogram
of range lengths, because a && b == NOT (a << b OR a >> b). I'm not sure if
the estimates we'd get that way would be better or worse than your current
method, but I think it would be easier to understand.I don't think the &< and &> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.
Oh, actually I didn't touch those operators. Selectivity estimation
functions for them were already in the catalog, they didn't work previously
just because no statistics. Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.
The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length as
argument, and returns the fraction of tuples with length >= argument. It
would do the lookup in the length histogram to find the right histogram
bin, and do the linear interpolation within the bin. You're assuming that
length is independent of lower/upper bound, so you shouldn't need any other
parameters than range length for that estimation.You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops significantly,
but I wasn't sure if there's some more intelligence in the way you combine
values from the length and lower bound histograms that you couldn't do with
such a helper function.
Yes, I also thought about something like this. But, in order to save
current estimate accuracy, it should be more complicated in following
reasons:
1) In last version, I don't estimate just fraction of tuples with length >=
argument, but area under length histogram between two length bounds
(length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to
return place where it happened. Now it is performed by hist_frac *= (length
- prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned
aspects.
------
With best regards,
Alexander Korotkov.
On 15.08.2012 10:38, Alexander Korotkov wrote:
On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note that
the estimation of overlap selectivity could be implemented using separate
histograms of lower bounds and upper bounds, without requiring a histogram
of range lengths, because a&& b == NOT (a<< b OR a>> b). I'm not sure if
the estimates we'd get that way would be better or worse than your current
method, but I think it would be easier to understand.I don't think the&< and&> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.Oh, actually I didn't touch those operators. Selectivity estimation
functions for them were already in the catalog, they didn't work previously
just because no statistics.
Yeah, without the histogram, the scalar selectivity estimator sort-of
works, in that it returns the estimate just based on the most common
values and a constant.
Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.
Hmm, if we collected a histogram of lower bounds and a histogram of
upper bounds, that would be roughly the same amount of data as for the
"standard" histogram with both bounds in the same histogram.
The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length as
argument, and returns the fraction of tuples with length>= argument. It
would do the lookup in the length histogram to find the right histogram
bin, and do the linear interpolation within the bin. You're assuming that
length is independent of lower/upper bound, so you shouldn't need any other
parameters than range length for that estimation.You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops significantly,
but I wasn't sure if there's some more intelligence in the way you combine
values from the length and lower bound histograms that you couldn't do with
such a helper function.Yes, I also thought about something like this. But, in order to save
current estimate accuracy, it should be more complicated in following
reasons:
1) In last version, I don't estimate just fraction of tuples with length>=
argument, but area under length histogram between two length bounds
(length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to
return place where it happened. Now it is performed by hist_frac *= (length
- prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned
aspects.
Thanks.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:
Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.Hmm, if we collected a histogram of lower bounds and a histogram of upper
bounds, that would be roughly the same amount of data as for the "standard"
histogram with both bounds in the same histogram.
Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of < and > operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?
------
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes:
Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of < and > operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?
You should assign a new pg_statistic "kind" value (see pg_statistic.h)
rather than mislabel this as being a standard histogram. However,
there's nothing wrong with a data-type-specific stats collection
function choosing to gather only this type of histogram and not the
standard one.
regards, tom lane
On 15.08.2012 11:34, Alexander Korotkov wrote:
On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.Hmm, if we collected a histogram of lower bounds and a histogram of upper
bounds, that would be roughly the same amount of data as for the "standard"
histogram with both bounds in the same histogram.Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?
Yeah, I think that makes more sense. The lower bound histogram is still
useful for < and > operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 15.08.2012 11:34, Alexander Korotkov wrote:
On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.Hmm, if we collected a histogram of lower bounds and a histogram of upper
bounds, that would be roughly the same amount of data as for the "standard"
histogram with both bounds in the same histogram.Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?
Yeah, I think that makes more sense. The lower bound histogram is still
useful for < and > operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:
On 15.08.2012 11:34, Alexander Korotkov wrote:
Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard"
histogram
with histograms of lower and upper bounds?Yeah, I think that makes more sense. The lower bound histogram is still
useful for < and > operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.
New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.
* Selectivity estimations for >, >=, <, <=, <<, >>, &<, &> using this
histogram.
------
With best regards,
Alexander Korotkov.
Attachments:
On 20.08.2012 00:31, Alexander Korotkov wrote:
On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:On 15.08.2012 11:34, Alexander Korotkov wrote:
Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard"
histogram
with histograms of lower and upper bounds?Yeah, I think that makes more sense. The lower bound histogram is still
useful for< and> operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.
Ah, that's an interesting approach. So essentially, the histogram looks
just like a normal STATISTIC_KIND_HISTOGRAM histogram, but the values
stored in it are not picked the usual way. The usual way would be to
pick N evenly-spaced values from the column, and store those. Instead,
you pick N evenly-spaced lower bounds, and N evenly-spaced upper bounds,
and construct N range values from those. Looking at a single value in
the histogram, its lower bound comes from a different row than its upper
bound.
That's pretty clever - the histogram has a shape and order that's
compatible with a histogram you'd get with the standard scalar
typanalyze function. In fact, I think you could just let the standard
scalar estimators for < and > to use that histogram as is. Perhaps we
should use STATISTIC_KIND_HISTOGRAM for this after all...
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 20.08.2012 00:31, Alexander Korotkov wrote:
On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:On 15.08.2012 11:34, Alexander Korotkov wrote:
Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard"
histogram
with histograms of lower and upper bounds?Yeah, I think that makes more sense. The lower bound histogram is still
useful for< and> operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.
One worry I have about that format for the histogram is that you
deserialize all the values in the histogram, before you do the binary
searches. That seems expensive if stats target is very high. I guess you
could deserialize them lazily to alleviate that, though.
* Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this
histogram.
Thanks!
I'm going to do the same for this that I did for the sp-gist patch, and
punt on the more complicated parts for now, and review them separately.
Attached is a heavily edited version that doesn't include the length
histogram, and consequently doesn't do anything smart for the &< and &>
operators. && is estimated using the bounds histograms. There's now a
separate stakind for the empty range fraction, since it's not included
in the length-histogram.
I tested this on a dataset containing birth and death dates of persons
that have a wikipedia page, obtained from the dbpedia.org project. I can
send a copy if someone wants it. The estimates seem pretty accurate.
Please take a look, to see if I messed up something.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com