Statistics and selectivity estimation for ranges

Started by Alexander Korotkovover 13 years ago41 messages
#1Alexander Korotkov
aekorotkov@gmail.com
1 attachment(s)

Hackers,

attached patch is for collecting statistics and selectivity estimation for
ranges.

In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of range
into some kind of 2d-histogram. However, this patch use some simplification
and assume distribution of lower bound and distribution of length to be
independent. We can get distribution of lower bound from standard scalar
statistics and thi patch additionally collect statistics for range length.
This patch includes selectivity estimations for "&&", "@>" and "<@"
operators on ranges. Linear interpolation is used in order to get more
accurate results.

Some examples with test dataset where left bound of range is distributed by
gaussian distribution and length of range is distributed by exponential
distribution.

test=# CREATE TABLE range_test as (SELECT int4range(lb, lb + len) AS r FROM
(SELECT (sqrt(-2*ln(random())) * sin(2*pi()*random()) * 1000000)::int as
lb, (-10000*ln(1.0 - random()) + 1)::int as len FROM
generate_series(1,1000000)) x);
SELECT 1000000

test=# ANALYZE range_test;
ANALYZE

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(700000,710000);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=7535 width=14) (actual
time=0.119..403.494 rows=6138 loops=1)
Filter: (r && '[700000,710000)'::int4range)
Rows Removed by Filter: 993862
Total runtime: 403.945 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(200000,300000);
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*42427* width=14)
(actual time=0.100..401.079 rows=*42725* loops=1)
Filter: (r && '[200000,300000)'::int4range)
Rows Removed by Filter: 957275
Total runtime: 403.055 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(100000,150000);
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*15341* width=14)
(actual time=0.129..382.114 rows=*16014* loops=1)
Filter: (r <@ '[100000,150000)'::int4range)
Rows Removed by Filter: 983986
Total runtime: 382.985 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(600000,603000);
QUERY PLAN

---------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*122* width=14) (actual
time=1.527..383.511 rows=*127* loops=1)
Filter: (r <@ '[600000,603000)'::int4range)
Rows Removed by Filter: 999873
Total runtime: 383.586 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(100000,100400);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*5166* width=14) (actual
time=0.238..377.712 rows=*3909* loops=1)
Filter: (r @> '[100000,100400)'::int4range)
Rows Removed by Filter: 996091
Total runtime: 378.018 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(500000,530000);
QUERY PLAN

----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*342* width=14) (actual
time=11.796..382.986 rows=*171* loops=1)
Filter: (r @> '[500000,530000)'::int4range)
Rows Removed by Filter: 999829
Total runtime: 383.066 ms
(4 rows)

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.2.patch.gzapplication/x-gzip; name=range_stat-0.2.patch.gzDownload
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#1)
Re: Statistics and selectivity estimation for ranges

On 04.08.2012 12:31, Alexander Korotkov wrote:

Hackers,

attached patch is for collecting statistics and selectivity estimation for
ranges.

In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of range
into some kind of 2d-histogram. However, this patch use some simplification
and assume distribution of lower bound and distribution of length to be
independent.

Sounds reasonable. Another possibility would be to calculate the average
length for each lower-bound bin. So you would e.g know the average
length of values with lower bound between 1-10, and the average length
of values with lower bound between 10-20, and so forth. Within a bin,
you would have to assume that the distribution of the lengths is fixed.

PS. get_position() should guard against division by zero, when subdiff
returns zero.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Matthias
nitrogenycs@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: Statistics and selectivity estimation for ranges

Having statistics on ranges was really missing! The planner was doing
some really, really bad choices on bigger tables regarding seq/random
scans, nested loop/other joins etc.

Is there any chance this makes it into 9.2 final? It would really
round-off the introduction of range types and maybe avoid problems
like "the new range types are slow" (just due to the bad row
estimates).

Thanks for implementing this feature,
-Matthias

#4Matthias
nitrogenycs@gmail.com
In reply to: Matthias (#3)
Fwd: Statistics and selectivity estimation for ranges

Having statistics on ranges was really missing! The planner was doing
some really, really bad choices on bigger tables regarding seq/random
scans, nested loop/other joins etc.

Is there any chance this makes it into 9.2 final? It would really
round-off the introduction of range types and maybe avoid problems
like "the new range types are slow" (just due to the bad row
estimates).

Thanks for implementing this feature,
-Matthias

#5Josh Berkus
josh@agliodbs.com
In reply to: Matthias (#4)
Re: Fwd: Statistics and selectivity estimation for ranges

Is there any chance this makes it into 9.2 final? It would really
round-off the introduction of range types and maybe avoid problems
like "the new range types are slow" (just due to the bad row
estimates).

Nope, that's strictly a 9.3 feature. 9.2 is in beta2.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: Statistics and selectivity estimation for ranges

On Mon, Aug 6, 2012 at 6:09 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 04.08.2012 12:31, Alexander Korotkov wrote:

Hackers,

attached patch is for collecting statistics and selectivity estimation for
ranges.

In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of
range
into some kind of 2d-histogram. However, this patch use some
simplification
and assume distribution of lower bound and distribution of length to be
independent.

Sounds reasonable. Another possibility would be to calculate the average
length for each lower-bound bin. So you would e.g know the average length
of values with lower bound between 1-10, and the average length of values
with lower bound between 10-20, and so forth. Within a bin, you would have
to assume that the distribution of the lengths is fixed.

Interesting idea. AFAICS, if we store average length for each lower-bound
bin, we still have to assume some kind of distribution of range length in
order to do estimates. For example, assume that range length have
exponential distribution. Correspondingly, we've following trade off: we
don't have to assume lower bound distribution to be independent from length
distribution, but we have to assume kind of length distribution. Actually,
I don't know what is better.
Ideally, we would have range length histogram for each lower-bound bin, or
upper-bound histogram for each lower-bound bin. But, storing such amount of
data seems too expensive.

------
With best regards,
Alexander Korotkov.

#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#6)
2 attachment(s)
Re: Statistics and selectivity estimation for ranges

For testing statistics accuracy I've used same datasets as for testing
opclasses performance:
http://archives.postgresql.org/pgsql-hackers/2012-07/msg00414.php
Script for testing and database schema is attached.
Dump with tests results can be downloaded here:
http://test.i-gene.ru/uploads/range_stat_tests.sql.gz

Following table shows statistics of accuracy when actual count of rows is
somewhat large (>=10). Second column shows average ratio of estimate count
of rows to actual count of rows. Third column shows average relative error
of estimation.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.27166784340153 | 0.498570654434906
@> | 1.35965412121763 | 0.384991198200582
&& | 1.08236985243139 | 0.105298599354035
(3 rows)

When result set is small (1-9 rows) then errors are more significant.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.51371646596783 | 2.85624536756285
@> | 3.85482923324034 | 2.91433432363562
&& | 3.14281204906205 | 2.28899260461761
(3 rows)

Following table presents average estimate count of rows when actual count
of rows is 0. This value is quite high for && operator, but it comes from
only one tests, so it's not really representative.

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+---------------------+-------------
<@ | 1.1259887005649718 | 1770
@> | 1.0598670878194025 | 88329
&& | 28.0000000000000000 | 1
(3 rows)

Same tables for statistics target = 1000.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.17132962269887 | 0.394427785424827
@> | 1.35677772347908 | 0.376171286348914
&& | 1.06762781136499 | 0.0874012522386387
(3 rows)

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.30836881177966 | 2.64459517657192
@> | 3.47535917820028 | 2.55199556747496
&& | 2.49181718664477 | 1.49181718664477
(3 rows)

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1650879566982409 | 739
@> | 1.0511811463771843 | 89447
(2 rows)

My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat.phpapplication/x-httpd-php; name=range_stat.phpDownload
range_stat_schema.sqlapplication/octet-stream; name=range_stat_schema.sqlDownload
#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#7)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

New revision of patch with two fixes:
1) Check if histogram bin width is zero in get_position.
2) Check statsTuple is valid tuple in rangecontsel.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.3.patch.gzapplication/x-gzip; name=range_stat-0.3.patch.gzDownload
#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#7)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On Thu, Aug 9, 2012 at 12:44 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.

ITSM, I found reason of inaccuracy. Implementation of linear interpolation
was wrong. Fixed version is attached. Now, need to rerun tests, possible
refactoring and comments rework.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.4.patch.gzapplication/x-gzip; name=range_stat-0.4.patch.gzDownload
��(P�}�[������Wl�slcH�I�-!$�=r�����y������+�zo��w>�K��6�9�9
M��V���3����Q8�v�2�E�����A0�$���,�l3��o�OrFR4X������d$E��}����#�Yw���N���5�>�������=�����������w.�D0�:��4���a���3�B�Q������V�X���0G��-~�r����Lo���0�A<�:W�5|����A*o�0��}M���}�Z�$��Q�Q��A�����?�#�<����vr���y�(���O��������'A>���I�x���w%���2������8�n��q��f�%�7Z,D8<_��n�y�4���;��t�[[���h�0���B��P]�U�Z4����~rpqtz��?{s���{[[O[����&hwvx����tr�h���l��gp�W�9��.�O������"�M�I���M����	��<Lb1NR�D1
����G^�l��3�l������u�-��k��_!�'2�[��,��t�d�#1J�������I4��YG�������`
�$�"N� ��&���Y���K���s���F�-|z������p$S(�f�@����6��*�p�0�E���}@_P$%'�4M� �U�b�9�Gq� ��2�����i����p-��%��z��8�C�A�o`���e�KY�;�p3��^��7�A�I��6�?M2p�&��Xo����'�'G��BO�9����G�X��IP�XFb���Dfl����'��Z�r��>Qc�SI(�����*��sq�i(���
�~������`3;"gjf�����E�O��+��L���������l�����X���������,`	&�������4�� �f�)�����`]KC��RQ�,�:!���b����/A`PB��0��q:v4����� �����U.�����>nou{[-h����<��c�&Q��+y-�d�L�I��t���8���<C�P����k�fH?�rD!�1�2zG�O.�^�������WX�/�m��o�x�@���L��s���[�a��:W/J�����D������/"����m43�i�M�a�Ml5	���0
@#g�L��,�`��e��0^I�mg=��7��:��,���8�U� m�t�0�]������d��xu�z���E����b���~x,��nw���x�~��?�����{s�#�K�xP��-�`8n���saoO�������94����MW|���G�&cK4��{6��dD��pT�A�qs|�su���X�S����� 
����@�_)P-�����A��������'�7�`��rog��/Iu�TR�-��`1����df�(��Y�2*�
=�^W/=�+@��M�!G����`+6�����z��0���`���(���t��?�������cw��+���3%�� ��(	�����auq}���#�U�f��[4��x(�������_���9i��5I��/m1v���M��^���-����u������X�C�%`�.
�X�G�8�i��bO4
�~sx�����W|x���V`�`�f��sZ�z��q�c �
:<Az�!x�����Y^��)g2�p��W���U�K1�<7��M���K�$�nGr���s��Dm�Z7�j�Vh���_Z~���e8]!���q���gt"�+�4�h�0_����[7[h����Oe>Kc��Gj�W������D��h|�l��M_�l���R�%4A�D}`$�Q�EW�B�&5�.�������
�]��� �0��A|�c�q<�D�����"\`����!bS�Q��Px`[F�xy���.[d
5<���e���x.��zI<��@I���T��t��������7�my��
���l����,�E�o:��5���x	��8S�a����V��M%WS��<Pjc���|�
�����$7Ye�-�!�
�k\�UY��`�2��:�����c��(�b!\z`	��n���'�N�C`\�w�`��C`���D���v��Q�X��CK~"O�@����C�	h��` ?86���*H,��8�
���s�wQ�8��&#S��6
'��a��B|A7QM���c��j.��4����K�< �/���l�����H�w�b�t��Zg�T���JR_l��!,o�r_JVh%��6�]���,�����1R�p�Q�	-��g�^*p�B��F��o�������yt|t��u`�W�B�
����~-�+��F��9��X������`����1�m��I:��^����\��
b�&lD0�++��5�:���+�1Q�&q�Ac��E���TA�)�8:�8{p������	�9���
�d:��2.�����c4m8��a���� <���J4���TRn�����w�
����=OQHr,Fg���D�&�i0Q1��Z�++�#�HA�
��$��^�*��!{�@R�#�}��9vA��Ge����t�������N��m\�5���4,4�"x�	XV�w5�$�u����J����i0��.�i�������vZ�����cZ)C"N��q����x���9�1����8ZJ��<��(��*��*<�e�S����"���NAu��
�7>�)����f�K�$�8����Q-R�je4���0R�>��7A~��gqFOg��&5+8)c��U�����*p`xQ��&VJ�y��3��.=�B�}�N�j ��F��U48�v��!tR����`J9���6{��������M�t�yF`vI�N���$�"��]��k0����8f7FE�-6���5���b��>����u�h���>�XTC��"��B	&�mFi1^a#�H�n�G����;�u��@z�v?&�F��^�U����@�<��8-��\�JbU'��W`�d�93�{�.)K���
���?���%��=���������
��S�{���K�&J���r�j�V#;K	���o�Jb��0��T��=��U���(4�����EuM�%
6C:"#��1�`��c0.�c����"G������mP#
`:�	'<�&'���5��<�
����8�zpc�F��T��pB:�:����h__P�=�s��EI�k�dt�0��)t�[�+�dTx��b�����������W���������-q��	G�a���.<(d��$������B�VSw���^��hz|x����{����{�a�{� �����,p{I�P��S2Sp2�^m��SX��a#�����Tvu!���t�O�$AH_E�� �"��p��*�Bz�U+/�tUN���YYxI�L�zT:ki�72'����
?4]Z?����������B=����s�!}��(ELkt=��	����<�i�
`�-�t�c������k,���v2��*@��-���������}���M������	�������RN1P5�0?�B�+)i���b���`\K�����E�D/��m�>���U����c{2�LO[����������0��/�x�a�QvA���5G�hP0��M8���F�����j
	�*�4LDh�q�C��
*T�<���an�( �`�����tZ��#����A�Q.��e���Tp������cp^)��D��.���AE
�o�������4���>�cJA����bcS�Wm�f��{�G�����x�D������]�[��t�i
9u7�1E�����f�s�H�EJ���;`(��	�,�����RME�8^�U�g�90���4i�
>7���0I�B�0v��LE_W�mI��K����XG-*RJ�����x8;�c9�+�T��	.	���QP�1��RD�,�%����nuG)j_�Y7���qC�Z����4`d��v��
��KD�K>-S�_���[`�X��"�[q#1�t�Cgs����$����ZA���J�7��)q�h�s��f�n�()SMi&O�B�p���zDU 4��,
��B�	���.�\��P�3�g@��e����:�m�&�7]=��kY��s�����6�t�����o��Jsk�1�������R��\���>��Z@o�`[�O�G��W����x��� |iNY�6]41���LSy��{�V���cT�+�I�������Yn5E��j]T�/��]���)�El��t\��p20��{�B�ieVr[��bG��0���m}�v)�*�L��2��N�Y
��UKz='Z
��\S��(�����"��J�yP,9	"
������Q"3��T$���1,4����5���)]z���Q�v$�LS���q\y���u[�>b��Nd?�0����m�@	.�cAh+f��(7E�O0@��^u�\e��,?_��W�����d,����.����2)�@#0�I+��J�V�4^N�0��S`Bt\�Id��U���U��d&��ru��.G�	�^{^��f��&{�`@�������R��R�C��mc����f������Q^`��qA���T�9n���`�G���Xbng���F.!E�������>4[�(?O�.��h.�!!�@�J&�a]�*�
�sM��*�R�Q��ve�pA�\e���'���~]������������������/Q����=wF����!�r?V�q��of_3&�m!����V�.��5��
��c-T0y�,0��D�B�:�b�
(6�4oefew��u��a]�Wuq��(d�I��m@A��T��@@�$�&�(���p2��+��2�"C�:�R���j�����+FT��7����+w��}�'�e���t�`����D��R�q��<A��&�(	�(c��*���~O��G�5��$�"�<��	����!)`�������,��heZP��r�U1q-�]�*���l���nI�F�D�������k�!�o��D��u!��SN3�J���7UJ��xN���|]�v�(�F��\��2+�z�Od'���V{&�f\6����'[��^�[���e��� ��N�5��t1�M��|[B��#8~	!
��	�Q��<kN��6O|~��]����T���}�2~��\����:�]Y��f�|��Q1ll���!>Ed�n��G58���GX�}�!���;�:C�����5~���a��A�DJ�/
�ON+�����~W*,��g����W��c�w]�

?�o����8��H���t�Ee=���r�/AJ��C��������R�>���
M�HM���"�_��1�`�%�O��aQ�� �L|*m���c��]$���K���D�LJ���� Qa��E&���v�^��[���B�3,B��M�J�V�
:�1���/�����
hA������5�����8����t�*��W<�8�8�M
`5�V���z���g����b���x%c�%���x���#|���a���Yp�"�_Jj�q~����[�F�;���e��Q�����0���d%�]�����r�����u���g1��
������_J��au�����N�I����&��)[���f�����ZI��V�K'��7�}�s6����O�0�]/l��d��pJ�����yk��q�#u�v��3�Z���Z��l&�O��;�����/z�O��T)��-��ySiK%�~���]�*��U�]�tx��
��H'1G[#CU��7����6��.`T���_i�P^��8%U����f��&���+��p����0���`a�_��_��������_8?�g��8A5;���6;�Q��f�������?6;��c�n�
V]�K�X7{as�w�YXA+2��`�+'9�����W}Zp��]�K}�..�]t���@vq����.V]�/�]�����,�cT��p��9���Y�1Z��Li��G�1G��H���dd1=Y��4�#k��d��*��}\��+����\.N[�_�m�OCY��L��W#kP7������y���c�+�9��K
dN�x6���wL�������������:9k��J?ah�fQ�u ���m�Ix���	��Y�	Z����+j�.��]���F�����h06��^:�M(/��Y����+�Y�K���h}����+9��O������������]urw,�u(*nQ�U�z�c"���$7�P,N�*�/���}zs�������y�����'m{6uI�9Ix��WJ�����#�!�!]b�Rb�6��z��.�������s
�����;)�T
�r��p�;>����V_>a+�c��U�&N[n�����h�
M6-Z���Q�,�?��5���Mc�;���Zr���6A����5C���e}Z\Y	�0p��-,\��0�a;����m��NXr)�[���C��w�p=[�T��(H���VGX���e�GI/l���(�B��'�����p�����N{�y5���M�l���im�%S��8�"{��_��P�
��/G%.���s��C)��L��!�o�D�����*(���E��J�M�9�~���BY>y��V���X����Vx��cBT�L2U'���dj�zw����1�������R���.dN��g�&�L�'z^fq�����������%�~�w�w�	������l��f���������������Zy��A�������C�mR���=��Vn��O�^o_��~��Y���v�������}*������DQ�v���`3K���`�	���c��Q��g�q��";�
���j,o��������u����W���t:���h��xu}}������	�+6����+�#�����na��N7�����*F�W�a4^���n�6�����3I���s��3��'3�He:��f�g%�ToZ �&���%��-�K�k]��m1�p2T +B�[�v���=3*����Za��B���Dv� ��a�p��B1Y8�F�]���C<������0l[������o���q�Uu�"�qs.d[�9
����A��a�7
^0Q�����d����8O�<���V��<���S �I'�2O������O�KRf��dz���6�M���}�����Z�"��=��c�&J������)Y�o�d6]j�%�$�fy��6��w6�,X�!�<�2zG�O.�^����Z���#��W�Q�K��@�M�JY�j����"�l3�M�[�r��!��������9mL���$�L�i���`8�M���P�Y������������5C��}~��@��&��xN�,�� M��>��Y|w1B�1�<�WH+�&j:0Qz��d*�������Pqq��Jj�9��h*P��}?�����EuM����J��pD��m����������I$y�n�?�F�D�P#���!bB(`�G��(���k���1~B�r�6@��4��Z*����W�C�i�+���C�73[V��������
���7���OPR��go�M��3>t���}�����y���X/N��F1����Z�{@�{j��lc�@�J���0uJ�����g4�,��;��#��7�m�7�2d

.��k�3�K�Ld�0�����Aun���r'>�CP�:�����#����Rt�����2�"��k	��M{�84'�"�&�B�����;x�|�;P�8-T^w+�>
G�D�9oO���>��������S~��	���+8���O��e9���Te�J�f�I��/�Q�"��cY�h��
�Q5m�q�ne+�����J��|rO}#���F�Pr"Mi����zz������������s��T�I$�z<qT�l��ZEWu�����O�u��*rW�s�B�Pz����`%F:����A5�a�������{�
�����%���Y	|Y!�>6'�@cs����]l���oyo���(�U<Gp�c����i�jo���kZ�W{�Z�V�&.����?�U�LV�|[�HXd��x�B#a���Z����3����)KP���_s��'�%<E��>�N�G�s����� �e���*,�����ivZd=.�V�������'<H9 3kh���Y�(�G�>�2�Y~���-�$U�|@����W�!;D��&vU0>����d��i��1gB6PP,:����:���X������8�1Y]N,��KjUg���]n}��H���$0�T��kj�bS��)7���k����;#qW�c�ZS��N�'w^�����>��|{����'��|G
��2�P������S�A �N��:m]']`��?��&�h��s��{v�������F�9UK]X-F`�{hhl�!�����V�k�I�+������j��f�M���d���8K)�����������}����Es*FUE��4������*�Z�C/�)f���`��b���q7������4��>�����0w������w�8����0�����"�lG+��d1���)�eNV��t�B�&k���s���^��~rW�&}���{�sA�?�S1��kD��+�p8���?�(�5Y����i�����������'�����O/�1c�TzFa����7��l}M!�|j��������=�e������%�`]����0D#�3Tk}Rk����R����r��`���vS(��Fbk�r���������J���[����u�E]�V�2��������������A�Um������	u)]��;%<d��(hxw�-Q\+[W�p��[��S��c�V�����������p?�;�s���WIC���I��6�ak�"�/�6G�z���jCU�O
�yEB��0&�y|]S�Z4}�c@�d�M<
��iX�������r���nBk�C�%�A��Fy-�������S�4igE�';|h�>��	u��60�R���5�+�X}93E�,W���Th6L�)�gR%H���h��%V��
|s��3���4v��|��/�J�M�)�6W�,�����F�"�1��tAXo�/H������`���}b�%���w���5����vd3���K`�F������������}@W�T��C8�=����l���6��9+[/�n=E�k/��fkE��b������_���l��\z�n��+
�TtZ�Dn��R���xF>�Y��Eq0��A�]��_;��6JY1�]�������|2-<�T?�Yv�B��>�4��U���i]p���H;#�?*a�H�����O��"����OO�^��M�s
m�k>�m8������>������UP�[@�L���[ch.x��G�W�q�0�/�?=|��d|{e8,�R����F2��,`������R������aFAm�*���9*�@�������A�n
�@+��6�^�Z��SB������*���f^��������v����o�<q
F��l�"���J�!����~�j	j������h����d���������'��G[���/<�J!^��5�p�0�e�!)�]lqno�����e�I�'�Z�=�������(W��_�j>L~$��!�a^����� (a,=zb��������;��BT�Q���p��������0>��p�>�|����Os�A�3z��n�z�ww�7��
��?�7<h����`|�z�����z�Bl��c���y�`It����E�>�q������sh�KX|7�([[O1�=���<�����D��EE7�'�nw	T,Av�*��
?Av�~����3-��R��!�?�����s�)�-zp;�6[��*��[�
�u/�j��tW��� �mRo�YZ��3Fgl6Jq-�:�6�b�E���(��V�I�����}����&��%��?�?ys�?����x�]��u���V�-���r�'�G'�L���_Q�-GIVsz�JQ�o���T�2��4M����?�Cy�xg��t��<�O���h������NT�%���.1#�w��+++ln�lo�����M���^i5������k'k�G�n���������/�o��������5�����@��O������j�RJ=�B���fJ�V�
�4��X��GKF�x�@6�@$��s����c�m���F����M�X��Jq��������=���i�b
����%f�u-i�>p�� X���L�l�<�v����O�����N��h�iw�t�@�����,����5�I���:^�K�������'=+�tA�eT4$��Q
(�:<�`�����S�k�-Lk��L�0
2�h�w��yy,�<�!RV��)�T��+TN1�f2dT�F��H�,�T���?T��J�TA��JAg���m�o�D4�|{��������Q����v-�]�5���*����4�CI�����bo��a��L���eA�:[�0�1�)N����D�������{dZ0�e�3�A$�M��yF�<�K�(���/��H����R����@�zL	Q��36��|5�S�F�<��P�P���u,��O}��j���y�$=���6x�d��v:��Ov�'=��zVe���,��3#�zM�]�+U��y��=v"�|�xh�D�������!0�t�p>B�Y\SO�m��!_;�����f7Z�g�����=*J;�
#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#9)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On Mon, Aug 13, 2012 at 1:11 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Thu, Aug 9, 2012 at 12:44 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.

ITSM, I found reason of inaccuracy. Implementation of linear interpolation
was wrong. Fixed version is attached. Now, need to rerun tests, possible
refactoring and comments rework.

After fixing few more bugs, I've a version with much more reasonable
accuracy.

Statistics target = 100.

Relatively large result sets (>= 10)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00404179116863 | 0.0504415454560903
@> | 1.06364108531688 | 0.105646077989812
&& | 1.00757984721409 | 0.0420984234933233
(3 rows)

Small result sets (1 - 9)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.31530838062865 | 0.654886592410495
@> | 2.78708078320147 | 1.94124123003433
&& | 1.93268112525538 | 1.09904919063335
(3 rows)

Empty result sets

test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1437670609645132 | 1099
@> | 1.0479430126460701 | 87458
(2 rows)

Statistics target = 1000.

Relatively large result sets (>= 10)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00073999445381 | 0.045099762607524
@> | 1.05296320350853 | 0.0907489633452971
&& | 1.00217602359039 | 0.0353421159150165
(3 rows)

Small result sets (1 - 9)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.26946358795998 | 0.577803898836364
@> | 2.69000633430211 | 1.83165424646645
&& | 1.48715184186882 | 0.577998652291105
(3 rows)

Empty result sets

test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.0887096774193548 | 1364
@> | 1.0423876983771183 | 89224
&& | 5.0000000000000000 | 1
(3 rows)

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.6.patch.gzapplication/x-gzip; name=range_stat-0.6.patch.gzDownload
���)P�}�[������Wl�slcH�I�-!$�=r�����y������+�zo��w>�K��6�9�9
M��V���3����Q8�v�2�E�����A0�$���,�l3��o�OrFR4X������d$E��}����#�Yw���N���5�>�������=�����������w.�D0�:��4���a���3�B�Q������V�X���0G��-~�r����Lo���0�A<�:W�5|����A*o�0��}M���}�Z�$��Q�Q��A�����?�#�<����vr���y�(���O��������'A>���I�x���w%���2������8�n��q��f�%�7Z,D8<_��n�y�4���;��t�[[���h�0���B��P]�U�Z4����~rpqtz��?{s���{[[O[����&hwvx����tr�h���l��gp�W�9��.�O������"�M�I���M����	��<Lb1NR�D1
����G^�l��3�l������u�-��k��_!�'2�[��,��t�d�#1J�������I4��YG�������`
�$�"N� ��&���Y���K���s���F�-|z������p$S(�f�@����6��*�p�0�E���}@_P$%'�4M� �U�b�9�Gq� ��2�����i����p-��%��z��8�C�A�o`���e�KY�;�p3��^��7�A�I��6�?M2p�&��Xo����'�'G��BO�9����G�X��IP�XFb���Df��#�v��_�p��^���S�JB�<�^ww�i�sq�i(���
�~������`3;"gjf�����E�O��+��L���������l�����X���������,`	&�������4�� �f�)�����`]KC��RQ�,�:!���b����/A`PB��0��q:v4����� �����U.�����>nou{[-h����<��c�&Q��+y-�d�L�I��t���8���<C�P����k�fH?�rD!�1�2zG�O.�^�������WX�/�m��o�x�@���L��s���[�a��:W/J�����D������/"����m43�i�M�a�Ml5	���0
@#g�L��,�`��e��0^I�mg=��7��:��,���8�U� m�t�0�]������d��xu�z���E����b���~x,��nw���x�~��?�����{s�#�K�xP��-�`8n���saoO�������94����MW|���G�&cK4��{6��dD��pT�A�qs|�su���X�S����� 
����@�_)P-�����A��������'�7�`��rog��/Iu�TR�-��`1����df�(��Y�2*�
=�^W/=�+@��M�!G����`+6�����z��0���`���(���t��?�������cw��+���3%�� ��(	�����auq}���#�U�f��[4��x(�������_���9i��5I��/m1v���M��^���-����u������X�C�%`�.
�X�G�8�i��bO4
�~sx�����W|x���V`�`�f��sZ�z��q�c �
:<Az�!x�����Y^��)g2�p��W���U�K1�<7��M���K�$�nGr���s��Dm�Z7�j�Vh���_Z~���e8]!���q���gt"�+�4�h�0_����[7[h����Oe>Kc��Gj�W������D��h|�l��M_�l���R�%4A�D}`$�Q�EW�B�&5�.�������
�]��� �0��A|�c�q<�D�����"\`����!bS�Q��Px`[F�xy���.[d
5<���e���x.��zI<��@I���T��t��������7�my��
���l����,�E�o:��5���x	��8S�a����V��M%WS��<Pjc���|�
�����$7Ye�-�!�
�k\�UY��`�2��:�����c��(�b!\z`	��n���'�N�C`\�w�`��C`���D���v��Q�X��CK~"O�@����C�	h��` ?86���*H,��8�
���s�wQ�8��&#S��6
'��a��B|A7QM���c��j.��4����K�< �/���l�����H�w�b�t��Zg�T���JR_l��!,o�r_JVh%��6�]���,�����1R�p�Q�	-��g�^*p�B��F��o�������yt|t��u`�W�B�
����~-�+��F��9��X������`����1�m��I:��^����\��
b�&lD0�++��5�:���+�1Q�&q�Ac��E���TA�)�8:�8{p������	�9���
�d:��2.�����c4m8��a���� <���J4���TRn�����w�
����=OQHr,Fg���D�&�i0Q1��Z�++�#�HA�
��$��^�*��!{�@R�#�}��9vA��Ge����t�������N��m\�5���4,4�"x�	XV�w5�$�u����J����i0��.�i�������vZ�����cZ)C"N��q����x���9�1����8ZJ��<��(��*��*<�e�S����"���NAu��
�7>�)����f�K�$�8����Q-R�je4���0R�>��7A~��gqFOg��&5+8)c��U�����*p`xQ��&VJ�y��3��.=�B�}�N�j ��F��U48�v��!tR����`J9���6{��������M�t�yF`vI�N���$�"��]��k0����8f7FE�-6���5���b��>����u�h���>�XTC��"��B	&�mFi1^a#�H�n�G����;�u��@z�v?&�F��^�U����@�<��8-��\�JbU'��W`�d�93�{�.)K���
���?���%��=���������
��S�{���K�&J���r�j�V#;K	���o�Jb��0��T��=��U���(4�����EuM�%
6C:"#��1�`��c0.�c����"G������mP#
`:�	'<�&'���5��<�
����8�zpc�F��T��pB:�:����h__P�=�s��EI�k�dt�0��)t�[�+�dTx��b�����������W���������-q��	G�a���.<(d��$������B�VSw���^��hz|x����{����{�a�{� �����,p{I�P��S2Sp2�^m��SX��a#�����Tvu!���t�O�$AH_E�� �"��p��*�Bz�U+/�tUN���YYxI�L�zT:ki�72'����
?4]Z?����������B=����s�!}��(ELkt=��	����<�i�
`�-�t�c������k,���v2��*@��-���������}���M������	�������RN1P5�0?�B�+)i���b���`\K�����E�D/��m�>���U����c{2�LO[����������0��/�x�a�QvA���5G�hP0��M8���F�����j
	�*�4LDh�q�C��
*T�<���an�( �`�����tZ��#����A�Q.��e���Tp������cp^)��D��.���AE
�o�������4���>�cJA����bcS�Wm�f��{�G�����x�D������]�[��t�i
9u7�1E�����f�s�H�EJ���;`(��	�,�����RME�8^�U�g�90���4i�
>7���0I�B�0v��LE_W�mI��K����XG-*RJ�����x8;�c9�+�T��	.	���QP�1��RD�,�%����nuG)j_�Y7���qC�Z����4`d��v��
��KD�K>-S�_���[`�X��"�[q#1�t�Cgs����$����ZA���J�7��)q�h�s��f�n�()SMi&O�B�p���zDU 4��,
��B�	���.�\��P�3�g@��e����:�m�&�7]=��kY��s�����6�t�����o��Jsk�1�������R��\���>��Z@o�`[�O�G��W����x��� |iNY�6]41���LSy��{�V���cT�+�I�������Yn5E��j]T�/��]���)�El��t\��p20��{�B�ieVr[��bG��0���m}�v)�*�L��2��N�Y
��UKz='Z
��\S��(�����"��J�yP,9	"
������Q"3��T$���1,4����5���)]z���Q�v$�LS���q\y���u[�>b��Nd?�0����m�@	.�cAh+f��(7E�O0@��^u�\e��,?_��W�����d,����.����2)�@#0�I+��J�V�4^N�0��S`Bt\�Id��U���U��d&��ru��.G�	�^{^��f��&{�`@�������R��R�C��mc����f������Q^`��qA���T�9n���`�G���Xbng���F.!E�������>4[�(?O�.��h.�!!�@�J&�a]�*�
�sM��*�R�Q��ve�pA�\e���'���~]������������������/Q����=wF����!�r?V�q��of_3&�m!����V�.��5��
��c-T0y�,0��D�B�:�b�
(6�4oefew��u��a]�Wuq��(d�I��m@A��T��@@�$�&�(���p2��+��2�"C�:�R���j�����+FT��7����+w��}�'�e���t�`����D��R�q��<A��&�(	�(c��*���~O��G�5��$�"�<��	����!)`�������,��heZP��r�U1q-�]�*���l���nI�F�D�������k�!�o��D��u!��SN3�J���7UJ��xN���|]�v�(�F��\��2+�z�Od'���V{&�f\6����'[��^�[���e��� ��N�5��t1�M��|[B��#8~	!
��	�Q��<kN��6O|~��]����T���}�2~��\����:�]Y��f�|��Q1ll���!>Ed�n��G58���GX�}�!���;�:C�����5~���a��A�DJ�/
�ON+�����~W*,��g����W��c�w]�

?�o����8��H���t�Ee=���r�/AJ��C��������R�>���
M�HM���"�_��1�`�%�O��aQ�� �L|*m���c��]$���K���D�LJ���� Qa��E&���v�^��[���B�3,B��M�J�V�
:�1���/�����
hA������5������Vz�����?����[a�l2���7:�%QZ�W�k4N���E�+���YV(�[,E��kxK�sw�^c�����+1$SR�C�_�6B�I�8�.;f���������Q��+��ZU^���������&.�v��Nj�S2��~)����FV�?�E*��@��ly���u.+tk9����	.���x������>��tw=����)��R'���o�v�5p�`���K�lk�{��a-�`T��sw���i�%^�R����\���[������J���y8�8Ur�������a,U�N���K��X�op���P]q��z����^�,���q8J���M��h�Mh�UW �����y���h�f�f����9��p����14��j��Wm�����~������l��7��~������nF����9,���Vd���(�WN���������7�X�F��q\6���}���\�W�q��:_>���WkY&�� �������e��w�H��R�E��c���c�	�b��6UiF�|�H�iR����gW�+	���\������`��� ���l�F8��n��k��	�fu���T���,���:�l&��hnB,e��i<�9��b�B�M��,����
u��;���A����������:al��J?��.<��W�$�Lm�4q�c2�A����`}E�-���&Z}*�k��>��O�����h�'��ge\Uj�k�����V���O������W
������k�'^����=jX��d]U�I���&������`���9�����K�����M���!+�/��de�-2ow�����/���\(9����Ib�/�q����*p3����W?7�Y|���k�e5k���%���\~�C�i�P��%����,����@�@3���P{�qY�WV��-�rz
+�2�c���d��;�`�\
������e��B*��&(�%$
�.���!R-�e��Q�[|g
�x\���YuM���2io<�&`�/�SC����:���dJ4�Qd�R��z�VqRp��F�r�~c��@(�����:��m��7��q�3P��T[�������0P�����0 �9����hU8�ejX~��z���:�C�$Su\HMR�v��q��J)�b`,���%�IK���.�M��g�&�L�'�]fq�V���]��6*x��q��_�_������#��������A��A�� ��`�/e����c��t�r;�<r{�����l>�W���Rh�@9�]q�x2#|��8����%Q�����e9����� ~�7���`�;'��a}AD���xHwk����$,�#)z�����U��;��p�;�z:^]__�+&�mpL���z����
���z���[@�6����������Uu��1��[�v3�3��pG?}��:�L�������~R��0���	Y�,�;�	+!|	�yK����n�\[���P����]~�~#�
�a�'�[���� 9����5z]7���LN������Nz�,C�#�:�V��D��)�+sA!b�iU�H����C�`X��t$�a�;A�����F����&�;��\F��#�"O����U�2�s����j"���lTq��D�-o�S���� ������aS�vw�������rpQ���X����G�`b�dJV��4�M��n�3	�Y��H��8��J*3Vf#������������G�x�������������&��kS�R��Zc-mo���,��g�<�V���6i��rs��ZwN�,7I&`�l�:�f��m&�`F��e��#4���,u������4xr	�.0�e?H����Gk_R�Fr��9���)�����L�^���
?�I���.>=\���2"��Z�G74�
�>�����q�s^Q]�#z �$��$�)@x[����L�_9I�`I^�[�O���!'�~��`��
��Q�4
�`�Z��+Bq�������P�2Mn��J>������l��
��$�p������d����#k{u����M����������s�C��]��e�����x���i/��S��Q�GF|P�V�Px����6��/P�R��,L��m��t��7KFd��x9����Mp��]�YA����i�Z���++'��<==nP���z�S7�������������j���j�5��,/C-�S���:�L�_�PA�&�	�Hy�	��}��~���,�0N���
�O��>�i�k�;��2��,%t��<�4�CswB�3�
�pp�S�FwY�6�#U��{RR��rR-���j��H"��X�)�g��qTM{���_E�B����:��`��)��S��d*�Q0���AS������^�����~�4/�0G��$9��FI�OU4 ���V��B�~�c�r`�a������P���7>X��m:�%rP������?g`�������n=�����w{(_V�����;����8�!�6E�[��[��|�$�0@�\��2f$}�����4��V���Ch����������xU-�)��5Y�*��HXu��-�*�g]��qnJ��$�������s	��������>������~��<HaY*��
�����5f��Y�K��9jF�~�u�	�Q���t1s)
�QP��v��Ff�c:I�{�e:���U������]L���m�[�d����C���
Q�����%z7��j��h(NuLV��v��Z�Y�|k�[s�k w"'	�:���Z����p��6���Z�c���H���3���������)_����#4��;������|)���B��39#��s�T-�D��"��RO[�I����d�	1Z��������n=�z8-���u��RV�X�r�t�=g�t�U�u��
..����8��2�*E�l�'���<�R��h15o����<��4`}L��*RU�����r����1�����A��[w�27gX:���7�Q����b_�xx�����H��s��E��`�����Xs���@���=�Hg#8Z�$�9��O).s����[��95X��xw����*�����~(es�{�@�?�S1��kD��+�p8���?�(�5Y����i���3������'�����O/�1c�TzFUb����7�Hl}M!�|j�����%��=�e������%�`]����0D#�3Tk}Rk����R����r��`�f�'sS(��Fbk�r����������>�K�[���Zv�]�V�2��������������A�U�������	uy]�D;%<d��(hxw�-Q\+[W�p~�[��S��c�V�����������p��;�s���WIC���I��6�ak�"�/�6G�z���jCUO
�yEB��0&�y|]S�Z4}e@�d�M<
���X$�������s���nBk�C�e�A���Fy-�������S*�4igE�';|h�C�u��60�R���5�+�X}93��,W���Th6L�)�gR%H���h��%V��
|s�%5���4v��|��9���M�)�6W�,�����F�J�1��t�Xo�/H������`���}b�%������?����vd3���K`�F�����.�����b��>��_����!��P
E�p6Kco���f��"����JX3���"�	z1������p^�����l��\z�n��+
�TtZ�Dn��R���xF>���Y��Eq0��A�]��_O��6JY1�]������|2-<�T?�Yv�B��>�4��U���i]p���H�%�?*a�H�����O��"����OO�^��M�s
m�k>�m8������>������UP�[@�Lo����0��W�q�0�/�?=|��d|�g8,�R����F2��-`������R������aFAm�*���9*�@�������A�n
�@+��6�^�Z��SB������*���f^��������v����o�<q
F��l�"���J�!����~�j	j������h����d���������'��G[���/<�J!^��5�p#1�j�!)�]lqno�����e�I�'�Z�=�������(W����j>L~$��!�a������ (a,=zb��������;��BT�Q�"��p��������0>��p�>�|����Os�A�3z��n�z�ww�7��
���?�7<h����`|�z�����z�Bl��c���y�`It����E��q������sh�KX|7�([[O1�=���P����]�D��EE7�'�nw	T,Av�*��
?Av�~����3-��R����?����s�)�zp;�6[��*��[�
�u/�j��tW��� �mR��YZ��3Fgl6Jq-�:�6���E���(��V�I�����}����&��%��?�?ys�?����x�]��u���V�-���r�'�G'�L���_Q�-GIVsz�JQ�o���T�2��4M����?�Cy�xg��t��<�O���h������NT�%���.1#�w��+++ln�lo�����M���^i5������k'k�G�n���������/�o��������5�����@��O������j�RJ=�B���fJ�V�
�4��X��GKF�x�@6�@$��s����c�m���F����M�X��Jq��������=���i�b
����%f�u-i?p�� X���L�l�<�v����O�����N��h�iw�t�@�����,����5�I���:^�K�������'=+�tA�eT4$��Q
(�:<�`�����S�k�-Lk��L�0
2�h�w��y�x,�<�!RV��)�T��+TN1�f2dT�F��H�,�T���@T��J�TA��JAg���m�o�D4�|���������Q����v-�]�5���*����4�CI�����bo��a��L���eA�:[�0�1�)N����X�������{dZ0�e�3,$����yF�<�K�(����-��H����R����@��L	Q��36��|5�S�F�<��P�P���u,��O}��j���y�$=���6x�d��v:��Ov�'=��zVe���,��3#�zu�]�+U��y��=v"�|�xh�D�������!0�t�p>B�Y\SO�m��!_;�����f7Z�g�����V)#$�
#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#10)
Re: Statistics and selectivity estimation for ranges

On 14.08.2012 09:45, Alexander Korotkov wrote:

After fixing few more bugs, I've a version with much more reasonable
accuracy.

Great! One little thing just occurred to me:

You're relying on the regular scalar selectivity estimators for the <<,

, &< and &> operators. That seems bogus, in particular for << and &<,

because ineq_histogram_selectivity then performs a binary search of the
histogram using those operators. << and &< compare the *upper* bound of
the value in table against the lower bound of constant, but the
histogram is constructed using regular < operator, which sorts the
entries by lower bound. I think the estimates you now get for those
operators are quite bogus if there is a non-trivial amount of overlap
between ranges. For example:

postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);
QUERY PLAN

--------------------------------------------------------------------------------
-----------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14)
(actual time=0.
060..1340.147 rows=200000 loops=1)
Filter: (r << '[200000,200001)'::int4range)
Rows Removed by Filter: 800000
Total runtime: 1371.865 ms
(4 rows)

It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note
that the estimation of overlap selectivity could be implemented using
separate histograms of lower bounds and upper bounds, without requiring
a histogram of range lengths, because a && b == NOT (a << b OR a >> b).
I'm not sure if the estimates we'd get that way would be better or worse
than your current method, but I think it would be easier to understand.

I don't think the &< and &> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.

The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length
as argument, and returns the fraction of tuples with length >= argument.
It would do the lookup in the length histogram to find the right
histogram bin, and do the linear interpolation within the bin. You're
assuming that length is independent of lower/upper bound, so you
shouldn't need any other parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops
significantly, but I wasn't sure if there's some more intelligence in
the way you combine values from the length and lower bound histograms
that you couldn't do with such a helper function.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#11)
Re: Statistics and selectivity estimation for ranges

On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 14.08.2012 09:45, Alexander Korotkov wrote:

After fixing few more bugs, I've a version with much more reasonable
accuracy.

Great! One little thing just occurred to me:

You're relying on the regular scalar selectivity estimators for the <<,

, &< and &> operators. That seems bogus, in particular for << and &<,

because ineq_histogram_selectivity then performs a binary search of the
histogram using those operators. << and &< compare the *upper* bound of the
value in table against the lower bound of constant, but the histogram is
constructed using regular < operator, which sorts the entries by lower
bound. I think the estimates you now get for those operators are quite
bogus if there is a non-trivial amount of overlap between ranges. For
example:

postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);
QUERY PLAN

------------------------------**------------------------------**
--------------------
------------------------------**-----
Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14) (actual
time=0.
060..1340.147 rows=200000 loops=1)
Filter: (r << '[200000,200001)'::int4range)
Rows Removed by Filter: 800000
Total runtime: 1371.865 ms
(4 rows)

It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note that
the estimation of overlap selectivity could be implemented using separate
histograms of lower bounds and upper bounds, without requiring a histogram
of range lengths, because a && b == NOT (a << b OR a >> b). I'm not sure if
the estimates we'd get that way would be better or worse than your current
method, but I think it would be easier to understand.

I don't think the &< and &> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.

Oh, actually I didn't touch those operators. Selectivity estimation
functions for them were already in the catalog, they didn't work previously
just because no statistics. Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length as
argument, and returns the fraction of tuples with length >= argument. It
would do the lookup in the length histogram to find the right histogram
bin, and do the linear interpolation within the bin. You're assuming that
length is independent of lower/upper bound, so you shouldn't need any other
parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops significantly,
but I wasn't sure if there's some more intelligence in the way you combine
values from the length and lower bound histograms that you couldn't do with
such a helper function.

Yes, I also thought about something like this. But, in order to save
current estimate accuracy, it should be more complicated in following
reasons:
1) In last version, I don't estimate just fraction of tuples with length >=
argument, but area under length histogram between two length bounds
(length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to
return place where it happened. Now it is performed by hist_frac *= (length
- prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned
aspects.

------
With best regards,
Alexander Korotkov.

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#12)
Re: Statistics and selectivity estimation for ranges

On 15.08.2012 10:38, Alexander Korotkov wrote:

On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note that
the estimation of overlap selectivity could be implemented using separate
histograms of lower bounds and upper bounds, without requiring a histogram
of range lengths, because a&& b == NOT (a<< b OR a>> b). I'm not sure if
the estimates we'd get that way would be better or worse than your current
method, but I think it would be easier to understand.

I don't think the&< and&> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.

Oh, actually I didn't touch those operators. Selectivity estimation
functions for them were already in the catalog, they didn't work previously
just because no statistics.

Yeah, without the histogram, the scalar selectivity estimator sort-of
works, in that it returns the estimate just based on the most common
values and a constant.

Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

Hmm, if we collected a histogram of lower bounds and a histogram of
upper bounds, that would be roughly the same amount of data as for the
"standard" histogram with both bounds in the same histogram.

The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length as
argument, and returns the fraction of tuples with length>= argument. It
would do the lookup in the length histogram to find the right histogram
bin, and do the linear interpolation within the bin. You're assuming that
length is independent of lower/upper bound, so you shouldn't need any other
parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops significantly,
but I wasn't sure if there's some more intelligence in the way you combine
values from the length and lower bound histograms that you couldn't do with
such a helper function.

Yes, I also thought about something like this. But, in order to save
current estimate accuracy, it should be more complicated in following
reasons:
1) In last version, I don't estimate just fraction of tuples with length>=
argument, but area under length histogram between two length bounds
(length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to
return place where it happened. Now it is performed by hist_frac *= (length
- prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned
aspects.

Thanks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: Statistics and selectivity estimation for ranges

On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Histogram of upper bounds would be both more

accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

Hmm, if we collected a histogram of lower bounds and a histogram of upper
bounds, that would be roughly the same amount of data as for the "standard"
histogram with both bounds in the same histogram.

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of < and > operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?

------
With best regards,
Alexander Korotkov.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#14)
Re: Statistics and selectivity estimation for ranges

Alexander Korotkov <aekorotkov@gmail.com> writes:

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of < and > operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?

You should assign a new pg_statistic "kind" value (see pg_statistic.h)
rather than mislabel this as being a standard histogram. However,
there's nothing wrong with a data-type-specific stats collection
function choosing to gather only this type of histogram and not the
standard one.

regards, tom lane

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#14)
Re: Statistics and selectivity estimation for ranges

On 15.08.2012 11:34, Alexander Korotkov wrote:

On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Histogram of upper bounds would be both more

accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

Hmm, if we collected a histogram of lower bounds and a histogram of upper
bounds, that would be roughly the same amount of data as for the "standard"
histogram with both bounds in the same histogram.

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?

Yeah, I think that makes more sense. The lower bound histogram is still
useful for < and > operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#14)
Re: Statistics and selectivity estimation for ranges

On 15.08.2012 11:34, Alexander Korotkov wrote:

On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Histogram of upper bounds would be both more

accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

Hmm, if we collected a histogram of lower bounds and a histogram of upper
bounds, that would be roughly the same amount of data as for the "standard"
histogram with both bounds in the same histogram.

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?

Yeah, I think that makes more sense. The lower bound histogram is still
useful for < and > operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#16)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 15.08.2012 11:34, Alexander Korotkov wrote:

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard"
histogram
with histograms of lower and upper bounds?

Yeah, I think that makes more sense. The lower bound histogram is still
useful for < and > operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.

New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.
* Selectivity estimations for >, >=, <, <=, <<, >>, &<, &> using this
histogram.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.7.patch.gzapplication/x-gzip; name=range_stat-0.7.patch.gzDownload
#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#18)
Re: Statistics and selectivity estimation for ranges

On 20.08.2012 00:31, Alexander Korotkov wrote:

On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

On 15.08.2012 11:34, Alexander Korotkov wrote:

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard"
histogram
with histograms of lower and upper bounds?

Yeah, I think that makes more sense. The lower bound histogram is still
useful for< and> operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.

New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.

Ah, that's an interesting approach. So essentially, the histogram looks
just like a normal STATISTIC_KIND_HISTOGRAM histogram, but the values
stored in it are not picked the usual way. The usual way would be to
pick N evenly-spaced values from the column, and store those. Instead,
you pick N evenly-spaced lower bounds, and N evenly-spaced upper bounds,
and construct N range values from those. Looking at a single value in
the histogram, its lower bound comes from a different row than its upper
bound.

That's pretty clever - the histogram has a shape and order that's
compatible with a histogram you'd get with the standard scalar
typanalyze function. In fact, I think you could just let the standard
scalar estimators for < and > to use that histogram as is. Perhaps we
should use STATISTIC_KIND_HISTOGRAM for this after all...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#18)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On 20.08.2012 00:31, Alexander Korotkov wrote:

On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

On 15.08.2012 11:34, Alexander Korotkov wrote:

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of< and> operators.
But I think it is not so important. So, we can replace "standard"
histogram
with histograms of lower and upper bounds?

Yeah, I think that makes more sense. The lower bound histogram is still
useful for< and> operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.

New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.

One worry I have about that format for the histogram is that you
deserialize all the values in the histogram, before you do the binary
searches. That seems expensive if stats target is very high. I guess you
could deserialize them lazily to alleviate that, though.

* Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this
histogram.

Thanks!

I'm going to do the same for this that I did for the sp-gist patch, and
punt on the more complicated parts for now, and review them separately.
Attached is a heavily edited version that doesn't include the length
histogram, and consequently doesn't do anything smart for the &< and &>
operators. && is estimated using the bounds histograms. There's now a
separate stakind for the empty range fraction, since it's not included
in the length-histogram.

I tested this on a dataset containing birth and death dates of persons
that have a wikipedia page, obtained from the dbpedia.org project. I can
send a copy if someone wants it. The estimates seem pretty accurate.

Please take a look, to see if I messed up something.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

range_stat-0.7-trimmed-and-edited.patchtext/x-diff; name=range_stat-0.7-trimmed-and-edited.patchDownload
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index a692086..a929f4a 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -30,7 +30,8 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
 	tsvector.o tsvector_op.o tsvector_parser.o \
-	txid.o uuid.o windowfuncs.o xml.o rangetypes_spgist.o
+	txid.o uuid.o windowfuncs.o xml.o rangetypes_spgist.o \
+	rangetypes_typanalyze.o rangetypes_selfuncs.o
 
 like.o: like.c like_match.c
 
diff --git a/src/backend/utils/adt/rangetypes.c b/src/backend/utils/adt/rangetypes.c
index fe9e0c4..f229a9d 100644
--- a/src/backend/utils/adt/rangetypes.c
+++ b/src/backend/utils/adt/rangetypes.c
@@ -1228,23 +1228,6 @@ hash_range(PG_FUNCTION_ARGS)
 	PG_RETURN_INT32(result);
 }
 
-/* ANALYZE support */
-
-/* typanalyze function for range datatypes */
-Datum
-range_typanalyze(PG_FUNCTION_ARGS)
-{
-	/*
-	 * For the moment, just punt and don't analyze range columns.  If we get
-	 * close to release without having a better answer, we could consider
-	 * letting std_typanalyze do what it can ... but those stats are probably
-	 * next door to useless for most activity with range columns, so it's not
-	 * clear it's worth gathering them.
-	 */
-	PG_RETURN_BOOL(false);
-}
-
-
 /*
  *----------------------------------------------------------
  * CANONICAL FUNCTIONS
diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c
new file mode 100644
index 0000000..ebfc427
--- /dev/null
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -0,0 +1,560 @@
+/*-------------------------------------------------------------------------
+ *
+ * rangetypes_selfuncs.c
+ *	  Functions for selectivity estimation of range operators
+ *
+ * Estimates are based on histograms of lower and upper bounds.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/rangetypes_selfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "catalog/pg_operator.h"
+#include "catalog/pg_statistic.h"
+#include "utils/lsyscache.h"
+#include "utils/rangetypes.h"
+#include "utils/selfuncs.h"
+#include "utils/typcache.h"
+
+static double calc_hist_selectivity(TypeCacheEntry *typcache,
+					VariableStatData *vardata, RangeType *constval, Oid operator);
+static double calc_hist_selectivity_scalar(TypeCacheEntry *typcache,
+							 RangeBound *constbound,
+							 RangeBound *hist, int hist_nvalues,
+							 bool equal);
+
+static int range_bsearch(TypeCacheEntry *typcache, RangeBound *value,
+			  RangeBound *hist, int hist_length, bool equal);
+static double calc_rangesel(TypeCacheEntry *typcache, VariableStatData *vardata,
+			  RangeType *constval, Oid operator);
+static float8 get_position(TypeCacheEntry *typcache, RangeBound *value,
+			 RangeBound *hist1, RangeBound *hist2);
+
+/*
+ * Returns a default selectivity estimate for given operator, when we don't
+ * have statistics or cannot use them for some reason
+ */
+static double
+default_range_selectivity(Oid operator)
+{
+	switch (operator)
+	{
+		case OID_RANGE_OVERLAP_OP:
+			return 0.01;
+
+		case OID_RANGE_CONTAINS_OP:
+		case OID_RANGE_CONTAINED_OP:
+			return 0.005;
+
+		case OID_RANGE_CONTAINS_ELEM_OP:
+			/*
+			 * "range @> elem" is more or less identical to a scalar
+			 * inequality "A > b AND A < c".
+			 */
+			return DEFAULT_RANGE_INEQ_SEL;
+
+		case OID_RANGE_LESS_OP:
+		case OID_RANGE_LESS_EQUAL_OP:
+		case OID_RANGE_GREATER_OP:
+		case OID_RANGE_GREATER_EQUAL_OP:
+		case OID_RANGE_LEFT_OP:
+		case OID_RANGE_RIGHT_OP:
+			/* These are similar to regular scalar inequalities */
+			return DEFAULT_INEQ_SEL;
+
+		default:
+			/* all range operators should be handled above, but just in case */
+			return 0.01;
+	}
+}
+
+/*
+ * rangesel -- restriction selectivity for range operators
+ */
+Datum
+rangesel(PG_FUNCTION_ARGS)
+{
+	PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+	Oid			operator = PG_GETARG_OID(1);
+	List	   *args = (List *) PG_GETARG_POINTER(2);
+	int			varRelid = PG_GETARG_INT32(3);
+	VariableStatData vardata;
+	Node	   *other;
+	bool		varonleft;
+	Selectivity selec;
+	TypeCacheEntry *typcache;
+	RangeType  *constval = NULL;
+
+	/*
+	 * If expression is not (variable op something) or (something op
+	 * variable), then punt and return a default estimate.
+	 */
+	if (!get_restriction_variable(root, args, varRelid,
+								  &vardata, &other, &varonleft))
+		PG_RETURN_FLOAT8(default_range_selectivity(operator));
+
+	/*
+	 * Can't do anything useful if the something is not a constant, either.
+	 */
+	if (!IsA(other, Const))
+	{
+		ReleaseVariableStats(vardata);
+		PG_RETURN_FLOAT8(default_range_selectivity(operator));
+	}
+
+	/*
+	 * All the range operators are strict, so we can cope with a NULL constant
+	 * right away.
+	 */
+	if (((Const *) other)->constisnull)
+	{
+		ReleaseVariableStats(vardata);
+		PG_RETURN_FLOAT8(0.0);
+	}
+
+	/*
+	 * If var is on the right, commute the operator, so that we can assume the
+	 * var is on the left in what follows.
+	 */
+	if (!varonleft)
+	{
+		/* we have other Op var, commute to make var Op other */
+		operator = get_commutator(operator);
+		if (!operator)
+		{
+			/* Use default selectivity (should we raise an error instead?) */
+			ReleaseVariableStats(vardata);
+			PG_RETURN_FLOAT8(default_range_selectivity(operator));
+		}
+	}
+
+	typcache = range_get_typcache(fcinfo, vardata.vartype);
+
+	/*
+	 * OK, there's a Var and a Const we're dealing with here.  We need the
+	 * Const to be of same range type as column, else we can't do anything
+	 * useful. (Such cases will likely fail at runtime, but here we'd rather
+	 * just return a default estimate.)
+	 *
+	 * If the operator is "range @> element", the constant should be of the
+	 * element type of the range column. Convert it to a range that includes
+	 * only that single point, so that we don't need special handling for
+	 * that in what follows.
+	 */
+	if (operator == OID_RANGE_CONTAINS_ELEM_OP)
+	{
+		if (((Const *) other)->consttype == typcache->rngelemtype->type_id)
+		{
+			/* Transform "col @> const_elem" case to "col @> const_range" case */
+			RangeBound lower, upper;
+			lower.inclusive = true;
+			lower.val = ((Const *) other)->constvalue;
+			lower.infinite = false;
+			lower.lower = true;
+			upper.inclusive = true;
+			upper.val = ((Const *) other)->constvalue;
+			upper.infinite = false;
+			upper.lower = false;
+			constval = range_serialize(typcache, &lower, &upper, false);
+		}
+	}
+	else
+	{
+		if (((Const *) other)->consttype == vardata.vartype)
+			constval = DatumGetRangeType(((Const *) other)->constvalue);
+	}
+
+	if (constval)
+		selec = calc_rangesel(typcache, &vardata, constval, operator);
+	else
+		selec = default_range_selectivity(operator);
+
+	ReleaseVariableStats(vardata);
+
+	CLAMP_PROBABILITY(selec);
+
+	PG_RETURN_FLOAT8((float8) selec);
+}
+
+/*
+ * calc_rangesel -- calculate restriction selectivity for range operators.
+ */
+static double
+calc_rangesel(TypeCacheEntry *typcache, VariableStatData *vardata,
+			  RangeType *constval, Oid operator)
+{
+	double		hist_selec;
+	double		selec;
+	float4	   *numbers, empty_frac, null_frac;
+	int			nnumbers;
+
+	if (HeapTupleIsValid(vardata->statsTuple))
+	{
+		Form_pg_statistic stats;
+		stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
+		null_frac = stats->stanullfrac;
+
+		/* Try to get fraction of empty ranges */
+		if (get_attstatsslot(vardata->statsTuple,
+							 vardata->atttype, vardata->atttypmod,
+							 STATISTIC_KIND_RANGE_EMPTY_FRAC, InvalidOid,
+							 NULL,
+							 NULL, NULL,
+							 &numbers, &nnumbers))
+		{
+			if (nnumbers != 1)
+				elog(ERROR, "invalid empty fraction statistic"); /* shouldn't happen */
+			empty_frac = numbers[0];
+		}
+		else
+		{
+			/* No empty fraction statistic. Assume no empty ranges. */
+			empty_frac = 0.0;
+		}
+	}
+	else
+	{
+		/*
+		 * No stats are available. Follow through the calculations below
+		 * anyway, assuming no NULLs and no empty ranges. This still allows
+		 * us to give a better-than-nothing estimate based on whether the
+		 * constant is an empty range or not.
+		 */
+		null_frac = 0.0;
+		empty_frac = 0.0;
+	}
+
+	if (RangeIsEmpty(constval))
+	{
+		/*
+		 * An empty range matches all, all empty ranges, or nothing,
+		 * depending on the operator
+		 */
+		switch (operator)
+		{
+			case OID_RANGE_OVERLAP_OP:
+			case OID_RANGE_OVERLAPS_LEFT_OP:
+			case OID_RANGE_OVERLAPS_RIGHT_OP:
+			case OID_RANGE_LEFT_OP:
+			case OID_RANGE_RIGHT_OP:
+				/* these return false if either argument is empty */
+				selec = 0.0;
+				break;
+
+			case OID_RANGE_CONTAINED_OP:
+			case OID_RANGE_LESS_EQUAL_OP:
+			case OID_RANGE_GREATER_EQUAL_OP:
+				/*
+				 * These return true when both args are empty, false if only
+				 * one is empty.
+				 */
+				selec = empty_frac;
+				break;
+
+			case OID_RANGE_CONTAINS_OP:
+				/* everything contains an empty range */
+				selec = 1.0;
+				break;
+
+			case OID_RANGE_CONTAINS_ELEM_OP:
+			default:
+				elog(ERROR, "unexpected operator %u", operator);
+				selec = 0.0; /* keep compiler quiet */
+				break;
+		}
+	}
+	else
+	{
+		/*
+		 * Calculate selectivity using bound histograms.
+		 */
+		hist_selec = calc_hist_selectivity(typcache, vardata, constval, operator);
+		if (hist_selec < 0.0)
+			hist_selec = default_range_selectivity(operator);
+
+		/*
+		 * Now merge the results for the empty ranges and histogram
+		 * calculations, realizing that the histogram covers only the
+		 * non-null, non-empty values.
+		 */
+		if (operator == OID_RANGE_CONTAINED_OP)
+		{
+			/* empty is contained by anything non-empty */
+			selec = (1.0 - empty_frac) * hist_selec + empty_frac;
+		}
+		else
+		{
+			/*
+			 * With any other operator, empty Op non-empty matches nothing.
+			 */
+			selec = (1.0 - empty_frac) * hist_selec;
+		}
+	}
+
+	/* all range operators are strict */
+	selec *= (1.0 - null_frac);
+
+	/* result should be in range, but make sure... */
+	CLAMP_PROBABILITY(selec);
+
+	return selec;
+}
+
+/*
+ * Binary search on an array of range bounds. Returns greatest index of range
+ * bound in array which is less than given range bound. If all range bounds in
+ * array are greater or equal than given range bound, return -1.
+ */
+static int
+range_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
+			  int hist_length, bool equal)
+{
+	int			lower = -1,
+				upper = hist_length - 1,
+				cmp,
+				middle;
+
+	while (lower < upper)
+	{
+		middle = (lower + upper + 1) / 2;
+		cmp = range_cmp_bounds(typcache, &hist[middle], value);
+
+		if (cmp < 0 || (equal && cmp == 0))
+			lower = middle;
+		else
+			upper = middle - 1;
+	}
+	return lower;
+}
+
+/*
+ * Calculate selectivity using histograms of range bounds
+ */
+static double
+calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
+					  RangeType *constval,Oid operator)
+{
+	Datum	   *hist_values;
+	int			nhist;
+	RangeBound *hist_lower;
+	RangeBound *hist_upper;
+	int			i;
+	RangeBound	const_lower;
+	RangeBound	const_upper;
+	bool		empty;
+	double		hist_selec;
+
+	/* Try to get histogram of ranges */
+	if (!(HeapTupleIsValid(vardata->statsTuple) &&
+		  get_attstatsslot(vardata->statsTuple,
+						   vardata->atttype, vardata->atttypmod,
+						   STATISTIC_KIND_BOUNDS_HISTOGRAM, InvalidOid,
+						   NULL,
+						   &hist_values, &nhist,
+						   NULL, NULL)))
+		return -1.0;
+
+	/*
+	 * Convert histogram of ranges into histogram of it's lower or upper
+	 * bounds.
+	 */
+	hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
+	hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
+	for (i = 0; i < nhist; i++)
+	{
+		range_deserialize(typcache, DatumGetRangeType(hist_values[i]),
+						  &hist_lower[i], &hist_upper[i], &empty);
+		/* The histogram should not contain any empty ranges */
+		if (empty)
+			elog(ERROR, "bounds histogram contains an empty range");
+	}
+
+	/* Extract the bounds of the constant value? */
+	range_deserialize(typcache, constval, &const_lower, &const_upper, &empty);
+	Assert (!empty);
+
+	/* Calculate selectivity of operator using corresponding function */
+	switch (operator)
+	{
+		case OID_RANGE_LESS_OP:
+			hist_selec =
+				calc_hist_selectivity_scalar(typcache, &const_lower,
+											 hist_lower, nhist, false);
+			break;
+
+		case OID_RANGE_LESS_EQUAL_OP:
+			/*
+			 * Estimate var < const by comparing the lower bounds. The upper
+			 * bound would matter for rows with equal lower bound equal to
+			 * the constant, but presumably there aren't many of those.
+			 */
+			hist_selec =
+				calc_hist_selectivity_scalar(typcache, &const_lower,
+											 hist_lower, nhist, true);
+			break;
+
+		case OID_RANGE_GREATER_OP:
+			/* Like for <, compare just the lower bounds. */
+			hist_selec =
+				1 - calc_hist_selectivity_scalar(typcache, &const_lower,
+												 hist_lower, nhist, true);
+			break;
+
+		case OID_RANGE_GREATER_EQUAL_OP:
+			hist_selec =
+				1 - calc_hist_selectivity_scalar(typcache, &const_lower,
+												 hist_lower, nhist, false);
+			break;
+
+		case OID_RANGE_LEFT_OP:
+			/* var << const if the upper bound of var < lower bound of const */
+			hist_selec =
+				calc_hist_selectivity_scalar(typcache, &const_lower,
+											 hist_upper, nhist, false);
+			break;
+
+		case OID_RANGE_RIGHT_OP:
+			/* var >> const if the lower bound of var > upper bound of const */
+			hist_selec =
+				1 - calc_hist_selectivity_scalar(typcache, &const_upper,
+												 hist_lower, nhist, true);
+			break;
+
+		case OID_RANGE_OVERLAPS_RIGHT_OP:
+			/* compare lower bounds */
+			hist_selec =
+				1 - calc_hist_selectivity_scalar(typcache, &const_lower,
+												 hist_lower, nhist, false);
+			break;
+
+		case OID_RANGE_OVERLAPS_LEFT_OP:
+			/* compare upper bounds */
+			hist_selec =
+				calc_hist_selectivity_scalar(typcache, &const_upper,
+											 hist_upper, nhist, true);
+			break;
+
+		case OID_RANGE_OVERLAP_OP:
+		case OID_RANGE_CONTAINS_ELEM_OP:
+			/*
+			 * A && B <=> NOT (A << B OR A >> B).
+			 *
+			 * "range @> elem" is equivalent to "range && [elem,elem]". The
+			 * caller already constructed the singular range from the element
+			 * constant, so just treat it the same as &&.
+			 */
+			hist_selec =
+				calc_hist_selectivity_scalar(typcache, &const_lower, hist_upper,
+											 nhist, false);
+			hist_selec +=
+				(1.0 - calc_hist_selectivity_scalar(typcache, &const_upper, hist_lower,
+												  nhist, true));
+			hist_selec = 1.0 - hist_selec;
+			break;
+
+		case OID_RANGE_CONTAINS_OP:
+		case OID_RANGE_CONTAINED_OP:
+			/* TODO: not implemented yet */
+			hist_selec = -1.0;
+			break;
+
+		default:
+			elog(ERROR, "unknown range operator %u", operator);
+			hist_selec = -1.0; /* keep compiler quiet */
+			break;
+	}
+
+	return hist_selec;
+}
+
+/*
+ * Get relative position of value in histogram bin in [0,1] range.
+ */
+static float8
+get_position(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist1,
+			 RangeBound *hist2)
+{
+	float8		position;
+
+	if (!hist1->infinite && !hist2->infinite)
+	{
+		float8		bin_width;
+
+		/*
+		 * Both bounds of histogram bin aren't infinite. Since value should be
+		 * in this bin, it shouldn't be infinite.
+		 */
+		Assert(!value->infinite);
+
+		/* Calculate relative position using subdiff function. */
+		bin_width = DatumGetFloat8(FunctionCall2Coll(
+												&typcache->rng_subdiff_finfo,
+													 typcache->rng_collation,
+													 hist2->val,
+													 hist1->val));
+		if (bin_width <= 0.0)
+		{
+			/* Assume any point to be in the center of zero width bin */
+			return 0.5;
+		}
+		position = DatumGetFloat8(FunctionCall2Coll(
+												&typcache->rng_subdiff_finfo,
+													typcache->rng_collation,
+													value->val,
+													hist1->val))
+			/ bin_width;
+
+		/* Relative position must be in [0,1] range */
+		position = Max(position, 0.0);
+		position = Min(position, 1.0);
+		return position;
+	}
+	else if (!hist2->infinite)
+	{
+		/*
+		 * One bound of histogram bin in infinite, another is not. Return 0 or
+		 * 1 depending on if value is infinite.
+		 */
+		if (value->infinite)
+			return 0.0;
+		else
+			return 1.0;
+	}
+	else
+	{
+		/* All bounds should be infinite. Assume this as 0.5 position */
+		Assert(hist1->infinite && hist2->infinite && value->infinite);
+		return 0.5;
+	}
+}
+
+/*
+ * Do selectivity estimation based on comparison.
+ */
+static double
+calc_hist_selectivity_scalar(TypeCacheEntry *typcache, RangeBound *constbound,
+							 RangeBound *hist, int hist_nvalues, bool equal)
+{
+	Selectivity	selec;
+	int			index;
+
+	/* Estimate selectivity as number of bins */
+	index = range_bsearch(typcache, constbound, hist, hist_nvalues, equal);
+	selec = (Selectivity) (Max(index, 0)) / (Selectivity) (hist_nvalues - 1);
+
+	/* Adjust selectivity using linear interpolation */
+	if (index >= 0 && index < hist_nvalues - 1)
+		selec += get_position(typcache, constbound, &hist[index],
+						&hist[index + 1]) / (Selectivity) (hist_nvalues - 1);
+
+	return selec;
+}
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
new file mode 100644
index 0000000..ecf6504
--- /dev/null
+++ b/src/backend/utils/adt/rangetypes_typanalyze.c
@@ -0,0 +1,234 @@
+/*-------------------------------------------------------------------------
+ *
+ * ragetypes_typanalyze.c
+ *	  Functions for gathering statistics from range columns
+ *
+ * For a range type column, histograms of lower and upper bounds, and
+ * the fraction of NULL and empty ranges are collected.
+ *
+ * Both histograms have the same length, and they are combined into a
+ * single array of ranges. This has the same shape as the histogram that
+ * std_typanalyze would collect, but the values are different. Each range
+ * in the array is a valid range, even though the lower and upper bounds
+ * come from different tuples (for each individual value in the table,
+ * lower <= upper bound, and that's enough to guarantee that the same
+ * applies to the percentiles of lower and upper bounds). In theory, the
+ * standard scalar selectivity functions could be used with that histogram.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/rangetypes_typanalyze.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_operator.h"
+#include "commands/vacuum.h"
+#include "utils/builtins.h"
+#include "utils/rangetypes.h"
+
+static void compute_range_stats(VacAttrStats *stats,
+		   AnalyzeAttrFetchFunc fetchfunc, int samplerows, double totalrows);
+
+/*
+ * range_typanalyze -- typanalyze function for range columns
+ */
+Datum
+range_typanalyze(PG_FUNCTION_ARGS)
+{
+	VacAttrStats *stats = (VacAttrStats *) PG_GETARG_POINTER(0);
+	TypeCacheEntry *typcache;
+	Form_pg_attribute attr = stats->attr;
+
+	/* Get information about range type */
+	typcache = range_get_typcache(fcinfo, stats->attrtypid);
+
+	if (attr->attstattarget < 0)
+        attr->attstattarget = default_statistics_target;
+
+	stats->compute_stats = compute_range_stats;
+	stats->extra_data = typcache;
+	/* Same as in std_typanalyze */
+	stats->minrows = 300 * attr->attstattarget;
+
+	PG_RETURN_BOOL(true);
+}
+
+/*
+ * Comparison function for sorting RangeBounds.
+ */
+static int
+range_bound_qsort_cmp(const void *a1, const void *a2, void *arg)
+{
+	RangeBound *b1 = (RangeBound *)a1;
+	RangeBound *b2 = (RangeBound *)a2;
+	TypeCacheEntry *typcache = (TypeCacheEntry *)arg;
+
+	return range_cmp_bounds(typcache, b1, b2);
+}
+
+/*
+ * compute_range_stats() -- compute statistics for range column
+ */
+static void
+compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
+					int samplerows, double totalrows)
+{
+	TypeCacheEntry *typcache = (TypeCacheEntry *) stats->extra_data;
+	int			null_cnt = 0;
+	int			non_null_cnt = 0;
+	int			non_empty_cnt = 0;
+	int			empty_cnt = 0;
+	int			range_no;
+	int			slot_idx;
+	int			num_bins = stats->attr->attstattarget;
+	int			num_hist;
+	RangeBound *lowers, *uppers;
+	double		total_width = 0;
+
+	/* Allocate memory for arrays of range bounds. */
+	lowers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
+	uppers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
+
+	/* Loop over the sample ranges. */
+	for (range_no = 0; range_no < samplerows; range_no++)
+	{
+		Datum		value;
+		bool		isnull,
+					empty;
+		RangeType  *range;
+		RangeBound	lower,
+					upper;
+
+		vacuum_delay_point();
+
+		value = fetchfunc(stats, range_no, &isnull);
+		if (isnull)
+		{
+			/* range is null, just count that */
+			null_cnt++;
+			continue;
+		}
+
+		/*
+		 * XXX: should we ignore wide values, like std_typanalyze does, to
+		 * avoid bloating the statistics table?
+		 */
+		total_width += VARSIZE_ANY(DatumGetPointer(value));
+
+		/* Get range and deserialize it for further analysis. */
+		range = DatumGetRangeType(value);
+		range_deserialize(typcache, range, &lower, &upper, &empty);
+
+		if (!empty)
+		{
+			/* Fill bound values for further usage in histograms */
+			lowers[non_empty_cnt] = lower;
+			uppers[non_empty_cnt] = upper;
+			non_empty_cnt++;
+		}
+		else
+			empty_cnt++;
+
+		non_null_cnt++;
+	}
+
+	slot_idx = 0;
+
+	/* We can only compute real stats if we found some non-null values. */
+	if (non_null_cnt > 0)
+	{
+		Datum	   *bound_hist_values;
+		int			pos,
+					posfrac,
+					delta,
+					deltafrac,
+					i;
+		MemoryContext old_cxt;
+		float4	   *emptyfrac;
+
+		stats->stats_valid = true;
+		/* Do the simple null-frac and width stats */
+		stats->stanullfrac = (double) null_cnt / (double) samplerows;
+		stats->stawidth = total_width / (double) non_null_cnt;
+		stats->stadistinct = -1.0;
+
+		/* Must copy the target values into anl_context */
+		old_cxt = MemoryContextSwitchTo(stats->anl_context);
+
+		if (non_empty_cnt > 0)
+		{
+			/* Sort bound values */
+			qsort_arg(lowers, non_empty_cnt, sizeof(RangeBound),
+					  range_bound_qsort_cmp, typcache);
+			qsort_arg(uppers, non_empty_cnt, sizeof(RangeBound),
+					  range_bound_qsort_cmp, typcache);
+
+			num_hist = non_empty_cnt;
+			if (num_hist > num_bins)
+				num_hist = num_bins + 1;
+
+			bound_hist_values = (Datum *) palloc(num_hist * sizeof(Datum));
+
+			/*
+			 * The object of this loop is to construct ranges from first and
+			 * last lowers[] and uppers[]  entries along with evenly-spaced
+			 * values in between. So the i'th value is range of
+			 * lowers[(i * (nvals - 1)) / (num_hist - 1)] and
+			 * uppers[(i * (nvals - 1)) / (num_hist - 1)]. But computing that
+			 * subscript directly risks integer overflow when the stats target
+			 * is more than a couple thousand.  Instead we add
+			 * (nvals - 1) / (num_hist - 1) to pos at each step, tracking the
+			 * integral and fractional parts of the sum separately.
+			 */
+			delta = (non_empty_cnt - 1) / (num_hist - 1);
+			deltafrac = (non_empty_cnt - 1) % (num_hist - 1);
+			pos = posfrac = 0;
+
+			for (i = 0; i < num_hist; i++)
+			{
+				bound_hist_values[i] = PointerGetDatum(range_serialize(
+								typcache, &lowers[pos], &uppers[pos], false));
+				pos += delta;
+				posfrac += deltafrac;
+				if (posfrac >= (num_hist - 1))
+				{
+					/* fractional part exceeds 1, carry to integer part */
+					pos++;
+					posfrac -= (num_hist - 1);
+				}
+			}
+
+			stats->stakind[slot_idx] = STATISTIC_KIND_BOUNDS_HISTOGRAM;
+			stats->stavalues[slot_idx] = bound_hist_values;
+			stats->numvalues[slot_idx] = num_hist;
+			slot_idx++;
+		}
+
+		/* Store the fraction of empty ranges */
+		emptyfrac = (float4 *) palloc(sizeof(float4));
+		*emptyfrac = ((double) empty_cnt) / ((double) non_null_cnt);
+		stats->stakind[slot_idx] = STATISTIC_KIND_RANGE_EMPTY_FRAC;
+		stats->stanumbers[slot_idx] = emptyfrac;
+		stats->numnumbers[slot_idx] = 1;
+		slot_idx++;
+
+		MemoryContextSwitchTo(old_cxt);
+	}
+	else if (null_cnt > 0)
+	{
+		/* We found only nulls; assume the column is entirely null */
+		stats->stats_valid = true;
+		stats->stanullfrac = 1.0;
+		stats->stawidth = 0;		/* "unknown" */
+		stats->stadistinct = 0.0;	/* "unknown" */
+	}
+	/*
+	 * We don't need to bother cleaning up any of our temporary palloc's. The
+	 * hashtable should also go away, as it used a child memory context.
+	 */
+}
diff --git a/src/include/catalog/pg_operator.h b/src/include/catalog/pg_operator.h
index 9470254..abfec5c 100644
--- a/src/include/catalog/pg_operator.h
+++ b/src/include/catalog/pg_operator.h
@@ -1676,32 +1676,45 @@ DATA(insert OID = 3882 (  "="	   PGNSP PGUID b t t 3831 3831 16 3882 3883 range_
 DESCR("equal");
 DATA(insert OID = 3883 (  "<>"	   PGNSP PGUID b f f 3831 3831 16 3883 3882 range_ne neqsel neqjoinsel ));
 DESCR("not equal");
-DATA(insert OID = 3884 (  "<"	   PGNSP PGUID b f f 3831 3831 16 3887 3886 range_lt scalarltsel scalarltjoinsel ));
+DATA(insert OID = 3884 (  "<"	   PGNSP PGUID b f f 3831 3831 16 3887 3886 range_lt rangesel scalarltjoinsel ));
 DESCR("less than");
-DATA(insert OID = 3885 (  "<="	   PGNSP PGUID b f f 3831 3831 16 3886 3887 range_le scalarltsel scalarltjoinsel ));
+#define OID_RANGE_LESS_OP 3884
+DATA(insert OID = 3885 (  "<="	   PGNSP PGUID b f f 3831 3831 16 3886 3887 range_le rangesel scalarltjoinsel ));
 DESCR("less than or equal");
-DATA(insert OID = 3886 (  ">="	   PGNSP PGUID b f f 3831 3831 16 3885 3884 range_ge scalargtsel scalargtjoinsel ));
+#define OID_RANGE_LESS_EQUAL_OP 3885
+DATA(insert OID = 3886 (  ">="	   PGNSP PGUID b f f 3831 3831 16 3885 3884 range_ge rangesel scalargtjoinsel ));
 DESCR("greater than or equal");
-DATA(insert OID = 3887 (  ">"	   PGNSP PGUID b f f 3831 3831 16 3884 3885 range_gt scalargtsel scalargtjoinsel ));
+#define OID_RANGE_GREATER_OP 3886
+DATA(insert OID = 3887 (  ">"	   PGNSP PGUID b f f 3831 3831 16 3884 3885 range_gt rangesel scalargtjoinsel ));
 DESCR("greater than");
-DATA(insert OID = 3888 (  "&&"	   PGNSP PGUID b f f 3831 3831 16 3888 0 range_overlaps areasel areajoinsel ));
+#define OID_RANGE_GREATER_EQUAL_OP 3887
+DATA(insert OID = 3888 (  "&&"	   PGNSP PGUID b f f 3831 3831 16 3888 0 range_overlaps rangesel areajoinsel ));
 DESCR("overlaps");
-DATA(insert OID = 3889 (  "@>"	   PGNSP PGUID b f f 3831 2283 16 3891 0 range_contains_elem contsel contjoinsel ));
+#define OID_RANGE_OVERLAP_OP 3888
+DATA(insert OID = 3889 (  "@>"	   PGNSP PGUID b f f 3831 2283 16 3891 0 range_contains_elem rangesel contjoinsel ));
 DESCR("contains");
-DATA(insert OID = 3890 (  "@>"	   PGNSP PGUID b f f 3831 3831 16 3892 0 range_contains contsel contjoinsel ));
+#define OID_RANGE_CONTAINS_ELEM_OP 3889
+DATA(insert OID = 3890 (  "@>"	   PGNSP PGUID b f f 3831 3831 16 3892 0 range_contains rangesel contjoinsel ));
 DESCR("contains");
-DATA(insert OID = 3891 (  "<@"	   PGNSP PGUID b f f 2283 3831 16 3889 0 elem_contained_by_range contsel contjoinsel ));
+#define OID_RANGE_CONTAINS_OP 3890
+DATA(insert OID = 3891 (  "<@"	   PGNSP PGUID b f f 2283 3831 16 3889 0 elem_contained_by_range rangesel contjoinsel ));
 DESCR("is contained by");
-DATA(insert OID = 3892 (  "<@"	   PGNSP PGUID b f f 3831 3831 16 3890 0 range_contained_by contsel contjoinsel ));
+#define OID_RANGE_ELEM_CONTAINED_OP 3891
+DATA(insert OID = 3892 (  "<@"	   PGNSP PGUID b f f 3831 3831 16 3890 0 range_contained_by rangesel contjoinsel ));
 DESCR("is contained by");
-DATA(insert OID = 3893 (  "<<"	   PGNSP PGUID b f f 3831 3831 16 3894 0 range_before scalarltsel scalarltjoinsel ));
+#define OID_RANGE_CONTAINED_OP 3892
+DATA(insert OID = 3893 (  "<<"	   PGNSP PGUID b f f 3831 3831 16 3894 0 range_before rangesel scalarltjoinsel ));
 DESCR("is left of");
-DATA(insert OID = 3894 (  ">>"	   PGNSP PGUID b f f 3831 3831 16 3893 0 range_after scalargtsel scalargtjoinsel ));
+#define OID_RANGE_LEFT_OP 3893
+DATA(insert OID = 3894 (  ">>"	   PGNSP PGUID b f f 3831 3831 16 3893 0 range_after rangesel scalargtjoinsel ));
 DESCR("is right of");
-DATA(insert OID = 3895 (  "&<"	   PGNSP PGUID b f f 3831 3831 16 0 0 range_overleft scalarltsel scalarltjoinsel ));
+#define OID_RANGE_RIGHT_OP 3894
+DATA(insert OID = 3895 (  "&<"	   PGNSP PGUID b f f 3831 3831 16 0 0 range_overleft rangesel scalarltjoinsel ));
 DESCR("overlaps or is left of");
-DATA(insert OID = 3896 (  "&>"	   PGNSP PGUID b f f 3831 3831 16 0 0 range_overright scalargtsel scalargtjoinsel ));
+#define OID_RANGE_OVERLAPS_LEFT_OP 3895
+DATA(insert OID = 3896 (  "&>"	   PGNSP PGUID b f f 3831 3831 16 0 0 range_overright rangesel scalargtjoinsel ));
 DESCR("overlaps or is right of");
+#define OID_RANGE_OVERLAPS_RIGHT_OP 3896
 DATA(insert OID = 3897 (  "-|-"    PGNSP PGUID b f f 3831 3831 16 3897 0 range_adjacent contsel contjoinsel ));
 DESCR("is adjacent to");
 DATA(insert OID = 3898 (  "+"	   PGNSP PGUID b f f 3831 3831 3831 3898 0 range_union - - ));
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 51449a4..77a3b41 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4544,6 +4544,8 @@ DATA(insert OID = 3902 (  hash_range			PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0
 DESCR("hash a range");
 DATA(insert OID = 3916 (  range_typanalyze		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "2281" _null_ _null_ _null_ _null_ range_typanalyze _null_ _null_ _null_ ));
 DESCR("range typanalyze");
+DATA(insert OID = 3169 (  rangesel				PGNSP PGUID 12 1 0 0 0 f f f f t f s 4 0 701 "2281 26 2281 23" _null_ _null_ _null_ _null_ rangesel _null_ _null_ _null_ ));
+DESCR("restriction selectivity for range operators");
 
 DATA(insert OID = 3914 (  int4range_canonical		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 3904 "3904" _null_ _null_ _null_ _null_ int4range_canonical _null_ _null_ _null_ ));
 DESCR("convert an int4 range to canonical form");
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index 8724cb5..e63b5e8 100644
--- a/src/include/catalog/pg_statistic.h
+++ b/src/include/catalog/pg_statistic.h
@@ -268,4 +268,19 @@ typedef FormData_pg_statistic *Form_pg_statistic;
  */
 #define STATISTIC_KIND_DECHIST	5
 
+/*
+ * A "empty frac" slot is the fraction of empty ranges in a range-type column.
+ * stavalues is not used and should be NULL. stanumbers contains a single
+ * entry, the fraction of empty ranges (0.0 to 1.0).
+ */
+#define STATISTIC_KIND_RANGE_EMPTY_FRAC	6
+
+/*
+ * A histogram similar to STATISTIC_KIND_HISTOGRAM, but the lower and upper
+ * bound values are calculated and sorted separately. In other words, this is
+ * actually two histograms combined into a single array of ranges. Contains
+ * only non-empty ranges.
+ */
+#define STATISTIC_KIND_BOUNDS_HISTOGRAM	7
+
 #endif   /* PG_STATISTIC_H */
diff --git a/src/include/utils/rangetypes.h b/src/include/utils/rangetypes.h
index fb0b23b..a7eeda0 100644
--- a/src/include/utils/rangetypes.h
+++ b/src/include/utils/rangetypes.h
@@ -170,6 +170,7 @@ extern Datum hash_range(PG_FUNCTION_ARGS);
 
 /* ANALYZE support */
 extern Datum range_typanalyze(PG_FUNCTION_ARGS);
+extern Datum rangesel(PG_FUNCTION_ARGS);
 
 /* Canonical functions */
 extern Datum int4range_canonical(PG_FUNCTION_ARGS);
#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#20)
Re: Statistics and selectivity estimation for ranges

On 24.08.2012 18:51, Heikki Linnakangas wrote:

On 20.08.2012 00:31, Alexander Korotkov wrote:

New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.

One worry I have about that format for the histogram is that you
deserialize all the values in the histogram, before you do the binary
searches. That seems expensive if stats target is very high. I guess you
could deserialize them lazily to alleviate that, though.

* Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this
histogram.

Thanks!

I'm going to do the same for this that I did for the sp-gist patch, and
punt on the more complicated parts for now, and review them separately.
Attached is a heavily edited version that doesn't include the length
histogram, and consequently doesn't do anything smart for the &< and &>
operators. && is estimated using the bounds histograms. There's now a
separate stakind for the empty range fraction, since it's not included
in the length-histogram.

I tested this on a dataset containing birth and death dates of persons
that have a wikipedia page, obtained from the dbpedia.org project. I can
send a copy if someone wants it. The estimates seem pretty accurate.

Please take a look, to see if I messed up something.

Committed this with some further changes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#22Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#21)
Re: Statistics and selectivity estimation for ranges

On Mon, Aug 27, 2012 at 5:00 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 24.08.2012 18:51, Heikki Linnakangas wrote:

On 20.08.2012 00:31, Alexander Korotkov wrote:

New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_**HISTOGRAM, which is lower
and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.

One worry I have about that format for the histogram is that you
deserialize all the values in the histogram, before you do the binary
searches. That seems expensive if stats target is very high. I guess you
could deserialize them lazily to alleviate that, though.

* Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this

histogram.

Thanks!

I'm going to do the same for this that I did for the sp-gist patch, and
punt on the more complicated parts for now, and review them separately.
Attached is a heavily edited version that doesn't include the length
histogram, and consequently doesn't do anything smart for the &< and &>
operators. && is estimated using the bounds histograms. There's now a
separate stakind for the empty range fraction, since it's not included
in the length-histogram.

I tested this on a dataset containing birth and death dates of persons
that have a wikipedia page, obtained from the dbpedia.org project. I can
send a copy if someone wants it. The estimates seem pretty accurate.

Please take a look, to see if I messed up something.

Committed this with some further changes.

Thanks! Sorry for I didn't provide a feedback for previous message.
Commited patch looks nice for me. I'm going to provide additional patch
with length-histogram and more selectivity estimates.

------
With best regards,
Alexander Korotkov.

#23Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#21)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On Mon, Aug 27, 2012 at 5:00 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 24.08.2012 18:51, Heikki Linnakangas wrote:

On 20.08.2012 00:31, Alexander Korotkov wrote:

New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_**HISTOGRAM, which is lower
and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.

One worry I have about that format for the histogram is that you
deserialize all the values in the histogram, before you do the binary
searches. That seems expensive if stats target is very high. I guess you
could deserialize them lazily to alleviate that, though.

* Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this

histogram.

Thanks!

I'm going to do the same for this that I did for the sp-gist patch, and
punt on the more complicated parts for now, and review them separately.
Attached is a heavily edited version that doesn't include the length
histogram, and consequently doesn't do anything smart for the &< and &>
operators. && is estimated using the bounds histograms. There's now a
separate stakind for the empty range fraction, since it's not included
in the length-histogram.

I tested this on a dataset containing birth and death dates of persons
that have a wikipedia page, obtained from the dbpedia.org project. I can
send a copy if someone wants it. The estimates seem pretty accurate.

Please take a look, to see if I messed up something.

Committed this with some further changes.

Addon patch is attached. Actually, I don't get your intention of
introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to leave
it as empty frac in distinct stakind or replace this stakind
with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
with STATISTIC_KIND_LENGTH_HISTOGRAM.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-addon-0.1.patch.gzapplication/x-gzip; name=range_stat-addon-0.1.patch.gzDownload
�<FP�=ks�����_1d�;~�v�@r6@`S
��u��\�<N����%'dw���3�=;���{�v�X����{F�a0�v�2H���L����(���,
�d��S/����D&�D��Y�'_\,3z%�7b�R����nwgkk%���������V�����v�{+�������n���g��Gs�������?�i��!~
"?������$��Jg��P����Zu��^������rO��K���!I��A�>�iZcx
� L��0	�
�����+Y~;[|�}C��#� �|D��^��h8�H�7�����Nz��[���~kk+��V����iW���o��G�_�z[��2�L�Z�"�C!�5���cxN!0
c/����$N�4��b�HTh9�\{�L���{���>L��^��P[n�&����I{~k�C�E�\pjB���;??
r<�g��8!c3 �FS��?���X���i��=.��!���US��P��B���WMHKV�{��,5���wLz�\|�g>+8���e�
!�D�F]�� �����T3_��n���+��$�%��#�fU��(0M=����8������%�>���X�b]���l%�C9�fa*��		�j��R�����e$��k��+�x#�<����O��������>;?9;�}��A������v3����=��E:A4
�P8#/L�c�&'w��4;�����i��������5��5ExA\z�������������������w��7<�g�I?$<[���!C�xP0��l�{�L���.�aX�����x$�x�a�b}�F#QGK��)�l��qZ���C�;��t�|6	�w�����0fh��@��Zy`��;?:?yw~�l��'��&�����s������8A�g������W�
�W����BNAn�"�g��q��V���������������~�Kg/����)���m*�9��6{����L�6��������)Bb�t
���c0�/Of��u��*ew�6���L������g5X]�a�f��)�<x������6��Tasw����w��k��h4�����h�:]���@��������������y�Mv�|�e�x�F6p�_�F��_�7!�?I~C�,S��X[c�(�bJd5 �bjd?��"����W��|��-W%�����&Z�&z�����gq����r���)��g7���`������*@�'�/������N?s!����>*~�Kd��=Zio!;~u�����o@��C<O������'O���[�sx(�6:z�yd������3^A��_������G������8��%D�����`w?�S9��I]�Bo�����:��Y20��AV����Y�@�\R����@�,�J-��8PN�<2�L��C�SEt"�U�/!id�4����M=m$S�+����U�c�����j~��l�� x�������Z���'c���DO�����b��x�>���v�����
nq������e����$@�r"���?���2�f[���<i����4��T������K��i��9om�<-��,Y�n8g8�#-������k�t��kg�����6�.`:��C��ML�@����^�@�Z# �f����$C��1�b���<����0o���~�)��,�z�#NFh��`���(���'/�Y
�E�f�j(��M�
.���ET��$�^��)��U��r���
��SPa��}���
���n)��J&��
�UZ�6eI�
�b=������8CI��Z�p�hv��7)���R�y�'R�]�i��J=�T��
�g�B������g��HX��6����m>p�����U�4�"�'��X�*V`���l�E�0���������wJ�)��00���|�F�%�*�k��Lmq^�	����J�����D���]���~.���������xc������2�*_�g-1�0�4��&L�i�p"�]qWN�"^�N?���F���Z^���J�Y��5����4�(��g���p����������0�/���[����2�*��|=SF"M����.�I[W�l�F�;(,%5�m=����������"\�����s���w���a���!B��5�������cX�A�Y���L-7qjf*]e?�
��i
�fX"���,��r:�Y�
���s��K�2U���Rq�+����U]��[�!����i��V����xn3Y�^B��$��
��g�h
�o����w0��9CP����K�Y�bO��6�i�`���k�.-�y`ok.��t.�����d=K5�53sj�4�����U�����o���4�����|y+����*����W+��Z�8~)�����o�,�X ��+r3��nyW�Xf1K1�KUx�H�S���tAjO�~�,��h�u�����h���I��H�$�u�&��(�jp}�D���b�:��n������J�@��.�}���Dl�-���GP�:]��d[��2K�.��>8\�4Z� xv�/�IL�l��8 �	�V�9�������c�k���-�[�=�=|�>6���lX�+���T&�8������'.�P���w�;�"!?<������������n4��1R��%����4*f*P<�`�9�|�>�]�Q0�$�������
�lUae�D��I�5a@[��m����^�Ds��d���1��Z��C��=����$��O�e��(\�O\$���E/�42���r�/��d�/
>
]�W�R�k���];�+��I|J��J��
X���a���RFKE���y)��w+=�X���(g
w����%i��K���_��L�T��;`��*J
�F)OY��
=G��9���,DB��DH� ��!w��I�����UsAe����
�)����D�.���Cj7�Y���SPrwG?&
����������d�4��2#'��K��w���a�Zm2��Tws�2X�}��~��O�".B�h<�u]D��^fo�������U��Zy�={����%�8��!TQ��_����=��G������65W�RL�r���%�*a�2��B���	@�KH�iR3��!(�5�Y���#u����-Y)���h�F�����x%���%oN���(����f�
����Bjd����1i��r��}����Ou���6�j�\�Pp#�|e��H~f�2}����9(���M�*h 0S�p6�rr��9����8?|$_��@�G�)l2�������;F�]�0����������9��Q[�d�������jRg�����[-�j��[4R����E.9�i����+����0��"�2�[ �����
�rF�!?M�.��t�YH�(P��	��&*�[���n���q���TH�?
����m����t���o�����6W�-�TJ�>�l������J��f����YA����7LKd�7g��
�(O1�3Kd��	�v�;*8�� �7�L���-CNE�J�X/f:2���o��*��CX�B��`�{')�������.�0�\c`�R-��=:�Q�����$�B�J���/���!%�SM�Wd��07ofr��1L8��H����������#  ����
�z-"z��PERb��[F07$J��Of1Y���0

�II���0[I�
K]��Z9u�v��t����6J����=�A�����'8�����m{HE�C>��������bwF+&&t����:��x�y�rII�����B&�Up
��n(�|Pf5;�Q����x��m��@@�r^&�G3��8���T9�K	��n����lJ���P��^K��|�ut�:��G��?P���u�w�Q�p�XP29I\f��@x�E,;�T��69�����,YE�mV�o���o�B}q+R�8���|�J�T:_,�����'?���������A*������������d�+�_����*��������E�%?��������z	�7��yW��D���X�O����,Y7q%xE'��C�?�@Tp������@������-n1�G�e�z����
�-sM_U���W�[��$&�N;�]�q�4_�>swqf�c�
�E�xz��#
�F>��'#�����6�z+�P�)��3�<�b&>��XW�wVz��`���.������T� �v��Wu2k�*��S��Q��������.
�.{}a�����]����z�Q��*�U��*[�,Q����{�\�i���Y�Z�Je$�}�P�rN�pbU��jQU>�����M?J�k���i��*B�����	3�@2�J
����rm�k�	�0p��<�}!/�(.v�x�]�B[N�snY��Qi����T�XA�������2F��]Z��]AR�0��7 wc<@�K�$�k��KyV����!�:�mw
�3�����F}Qd� c���|h�L�h����x	�&�N,��!���7T�������:����60�I�r������vi\���1���$���8y�{&'�+��M��T�	�d!�����Be�����c���*dJq���{ee���ww�����_9��&���`�������m>\�:��"/��S.��?���hskgo��������.=K��h?t��\!nB�����z�������)�7"_q�?���J�i�u�:�Hq�
y�.���6�T��<`Q-����n6N�x<��r�z�b7��o����w�A��5n�)�DD��B��v�#��sN���M���Wt�^�W�oi�\�G}s�w[���h{��g:�6�}9��uy� 4M 0JS��9�Ao
�|Hw��w���t�a%����]�e��3R���Y�^@�+td^����uwhCx�k=���T/���`{��{���M��5?�kF�W{��1��H��Y��Z�$h����qQ���{��}Ki{��V��g�yq��
;����ru6@P�		5���O�� hpn�U�����uM
�X�)��s����&�j�
��`��h�&gx� cS������������
=�����d���@�x��R��x��Zr�5F��Q�?��+����"_&�n����X2��n�6�V�Y'_�N g�	���V�f��[A�����T;�k���@e���=���XI��Y��{[ ��Yw�����_�4+���Zf W������Q���������R.ly�A��n��?��&@)����������-�v���z���p�d��6��7)C��'Q�{�&6}c����T�JW�P�s*��4�����5.����W����z���]�Hfu�f�d��L�_>��C����������v6�>Q��y��C���L]�y���������Y��B<%	������qe:�Gq4�fa��?X�����=~�ZD~!�[`��l����g�#��������YWq�����ug�q��&qb���%n�l[���l��^�m�o<��e���� A�R��Q�����t��:�U����&�;���7��z��\���Z�(!�����^
�_
�d��6���3F�
��A-)a��km�������n.��|��~�K����J������.Q`�0�R��*���V�H%���
�_Y�H|�,~v�����V-� �����RDu6:-��Y�BJ�nL	��a���T�8:F�u
�a���L���2�B�%2�a��k�iD�B�"�9�c|����)�������B��.���C�j����A9�@Q����Qx�N&�/����r���;�`S@���S��Z�
��L��l��W>t���]~FF�-�W���`��a��ZB�6A��z��K�vv-������xLj$8W�S�l������+�S���H��7T["��zs$1f�:)�5/�L�����[i��m@�#�fi�������hlx�B����8�V��c�	mT�O�W�S���g@Z��6J������22J_"���`:AEb���=��&2
�A�[�t;?�.}5S�?<�-K��F�Y�B~���n�}��0��@x���>�c.��T����Y�x��u��+�$g�0������Se���`T���,�j'����T�TU��)�L j
\������������0�t���)
�H���x��O�����X���M���S�sp��K�O��F%�@Zv���p�0k�!��BTS(R�:�F�wN�"��1�����H��~��������.0�U��?.K??:?���/Q����D]��'����yy���x����nk{��}�Sx~�����*���S!����)�<�=��_����d����	?j���o+u�/G�o��D���?b�
+��98����D����
<7����&d���������"���O`|.5�.-|.|��$���jq���x�v��e���������e�:W���yi����/��<�7���u�L�8s��o�wM���'��<?~��Im�T��76�U��h{V9J��.d2�����mD��q6�CU�l�U��<��=~�UVN�V�0����'�
eF������.�
��������O<��Yq���H����+��[�8�/��rp�o����q��<�3���<��|��Q��	�f�KV;a,"3D�
r��Xg�vI��6�/z���Q�)�97��i���P[C&���"�
����r$6 ������-C��(�D�DH6�?����>�f5$G3����~�P���5 ���9�4��V����Yv�.������D�j��d2�?�[�@2n��=�q���DB��B��Br_����C�����@t���~�}���1:����E�����X'�Q�F����^�An��'��E]�RJ�y�RD='~�'��7�:��y�*I��)��P[���|
#24Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Korotkov (#23)
Re: Statistics and selectivity estimation for ranges

On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:

Addon patch is attached. Actually, I don't get your intention of
introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
leave it as empty frac in distinct stakind or replace this stakind
with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
with STATISTIC_KIND_LENGTH_HISTOGRAM.

Review comments:

1. In compute_range_stats, you need to guard against the case where
there is no subdiff function. Perhaps default to 1.0 or something?

2. I think it would be helpful to add comments regarding what happens
when lengths are identical, right now it's a little confusing. For
instance, the comment: "Generate a length histogram slot entry if there
are at least two length values" doesn't seem right, because the
condition still matches even if there is only one distinct value.

3. It looks like get_distance also needs to guard against a missing
subdiff.

4. There are 3 binary search functions, which seems a little excessive:
* rbound_bsearch: greatest i such that hist[i] < v; or -1
* rbound_bsearch_equal: greatest i such that:
hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
* length_hist_bsearch: least i such that hist[i] >= v;
or length of hist
(let me know if I misunderstood the definitions)
At a minimum, we need more consistent and informative names. Also, the
definition of rbound_bsearch_equal is a little confusing because it's
looking for the highest index among distinct values, but the lowest
index among identical values. Do you see a way to refactor these to be a
little easier to understand?

Regards,
Jeff Davis

#25Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeff Davis (#24)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On Mon, Oct 1, 2012 at 3:22 AM, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:

Addon patch is attached. Actually, I don't get your intention of
introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
leave it as empty frac in distinct stakind or replace this stakind
with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
with STATISTIC_KIND_LENGTH_HISTOGRAM.

Review comments:

1. In compute_range_stats, you need to guard against the case where
there is no subdiff function. Perhaps default to 1.0 or something?

Let it be 1.0 without subdiff function. However, there is not so much use
of this method of estimation without subdiff. But, probably it's better
than nothing.

2. I think it would be helpful to add comments regarding what happens

when lengths are identical, right now it's a little confusing. For
instance, the comment: "Generate a length histogram slot entry if there
are at least two length values" doesn't seem right, because the
condition still matches even if there is only one distinct value.

I've rephrased comment. Not it's implicitly says that collected values are
not necessary distinct.

3. It looks like get_distance also needs to guard against a missing
subdiff.

Same to compute_range_stats. Let default value be 1.0.

4. There are 3 binary search functions, which seems a little excessive:
* rbound_bsearch: greatest i such that hist[i] < v; or -1
* rbound_bsearch_equal: greatest i such that:
hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
* length_hist_bsearch: least i such that hist[i] >= v;
or length of hist
(let me know if I misunderstood the definitions)
At a minimum, we need more consistent and informative names. Also, the
definition of rbound_bsearch_equal is a little confusing because it's
looking for the highest index among distinct values, but the lowest
index among identical values. Do you see a way to refactor these to be a
little easier to understand?

Actually, goal of rbound_bsearch_equal is to find histogram bin to start
interpolation from. I've renamed it to rbound_bsearch_bin and added
corresponding comment.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.8.patch.gzapplication/x-gzip; name=range_stat-0.8.patch.gzDownload
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Korotkov (#25)
Re: Statistics and selectivity estimation for ranges

Heikki, would you be able to give this patch a look and perhaps commit
it?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Korotkov (#25)
Re: Statistics and selectivity estimation for ranges

What's going on with this patch? I haven't seen any activity in a
while. Should I just move this to the next commitfest?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28Jeff Davis
pgsql@j-davis.com
In reply to: Alvaro Herrera (#27)
Re: Statistics and selectivity estimation for ranges

On Mon, 2012-11-05 at 11:12 -0300, Alvaro Herrera wrote:

What's going on with this patch? I haven't seen any activity in a
while. Should I just move this to the next commitfest?

Sorry, I dropped the ball here. I will still review it, whether it makes
this commitfest or not.

Regards,
Jeff Davis

#29Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Davis (#28)
Re: Statistics and selectivity estimation for ranges

On 2012-11-05 22:10:50 -0800, Jeff Davis wrote:

On Mon, 2012-11-05 at 11:12 -0300, Alvaro Herrera wrote:

What's going on with this patch? I haven't seen any activity in a
while. Should I just move this to the next commitfest?

Sorry, I dropped the ball here. I will still review it, whether it makes
this commitfest or not.

Sorry to nag, but it starts to look like it might fall of the end of the
next CF...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Korotkov (#25)
Re: Statistics and selectivity estimation for ranges

It looks like there are still some problems with this patch.

CREATE TABLE foo(ir int4range);
insert into foo select 'empty' from generate_series(1,10000);
insert into foo select int4range(NULL, g, '(]')
from generate_series(1,1000000) g;
insert into foo select int4range(g, NULL, '[)')
from generate_series(1,1000000) g;
insert into foo select int4range(g, ((g*1.01)+10)::int4, '[]')
from generate_series(1,1000000) g;
CREATE TABLE bar(ir) AS select * from foo order by random();
ANALYZE bar;

Now:
EXPLAIN ANALYZE SELECT * FROM bar
WHERE ir @> int4range(10000,20000);

The estimates are "-nan". Similar for many other queries.

And I have a few other questions/comments:

* Why is "summ" spelled with two "m"s? Is it short for "summation"? If
so, might be good to use "summation of" instead of "integrate" in the
comment.

* Why does get_length_hist_frac return 0.0 when i is the last value? Is
that a mistake?

* I am still confused by the distinction between rbound_bsearch and
rbound_bsearch_bin. What is the intuitive purpose of each?

* You use "constant value" in the comments in several places. Would
"query value" or "search key" be better?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeff Davis (#30)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

Hi, Jeff!

Thanks a lot for review!

On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis <pgsql@j-davis.com> wrote:

It looks like there are still some problems with this patch.

CREATE TABLE foo(ir int4range);
insert into foo select 'empty' from generate_series(1,10000);
insert into foo select int4range(NULL, g, '(]')
from generate_series(1,1000000) g;
insert into foo select int4range(g, NULL, '[)')
from generate_series(1,1000000) g;
insert into foo select int4range(g, ((g*1.01)+10)::int4, '[]')
from generate_series(1,1000000) g;
CREATE TABLE bar(ir) AS select * from foo order by random();
ANALYZE bar;

Now:
EXPLAIN ANALYZE SELECT * FROM bar
WHERE ir @> int4range(10000,20000);

The estimates are "-nan". Similar for many other queries.

Oh, yeah! It appears that infinities require much more cautious work with
them than I supposed. That should be fixes in the attached version of
patch. However, it require significant rethinking of comments. Will update
comments and address your questions in a couple of days. Could you recheck
if attached patch really fixes problem you reported?

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.9.patch.gzapplication/x-gzip; name=range_stat-0.9.patch.gzDownload
#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeff Davis (#30)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis <pgsql@j-davis.com> wrote:

And I have a few other questions/comments:

* Why is "summ" spelled with two "m"s? Is it short for "summation"? If
so, might be good to use "summation of" instead of "integrate" in the
comment.

Fixed.

* Why does get_length_hist_frac return 0.0 when i is the last value? Is
that a mistake?

Comment was wrong. Actually it return fraction fraction of ranges which
length is *greater*.

* I am still confused by the distinction between rbound_bsearch and
rbound_bsearch_bin. What is the intuitive purpose of each?

I've added corresponding comments. rbound_bsearch is for scalar operators
and for bin corresponding to upper bound. rbound_bsearch_bin is
now rbound_bsearch_bin_lower. It is for bin corresponding to lower bound.

* You use "constant value" in the comments in several places. Would

"query value" or "search key" be better?

Yes. Fixed.

I also renamed get_length_hist_frac to get_length_hist_summ and rewrote
comments about it. Hope it becomes more understandable.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.10.patch.gzapplication/x-gzip; name=range_stat-0.10.patch.gzDownload
����P�=�S�H�?��b��Kl,���;��,u	��[W��K�e��X>K�����_��z$�H6�����X����t�����~_U��a���x�]�������>I�A������?<��Q��`���q��:��~0�U?�2��Q�o�Z�a/�����N�^�u���fg�����r�<�V�K������l������ni��J���	���w�A�_$�Q�$>�S�Xu�t��D����v4
�~�g4�?	�$�b��h�c�L�As���A|w��EP�:|�{Kx�!�"���M:���?��	|S:�/���a����64��ZYYQ��b;����qr�<u��!I�)4u;�}2H�dW�M��0�)�����*:I,�7j
A��<D�_�q��S�x����q��
���o�PX5��3������y4�t���#���{V2�r���$�|����0Qcj����?�^dz!����Z������V�a��-��d� �'��D�@���9�� ��m��Eq���p�8�n��3�\�
/��	�V����	��At��sQ�Cz��\2H��{��v��&�����S#g�Z��C��%�?��%���Z��i���#[ue����C�~hdm�^g��;��#%mZr����w
�[�i����i��;�
�4C����$��TLF��#�<jJ0�4�QM'F�i�����|�a���,���R�����5u$��0V��7��
����@���:�����<u}?�a�������b}�w���~>zqvx|��?y}Z�Y����5���y��
vb&!���a?�������8x*_q2oh�����q`������>�<x�	O ��� �l���>h�8>:�?<:m���>���O�.M��&x�8���~O�M��`���>�"~t�;��;_YI���|��P����������u����Ub�!��
���4K"�L�1�*���rJ�p��:�������$H�8DII���G��&�������F=^�	D�����g�g��g�/��<<2�{�������N�_x��A�
�n�~~�&�;���pr�	�`b
���2K�F�k���
��ysp�����O�������/A�;�Qt+�5��G)��q��+�Qwj�U4����q49�����5�H�Xuxm����k��S~O.���F4�
3����3P'�5�����$&���3H'H�`\M.�au%�� ?@_�8mncD�P�&��6b��o6@�t,(��`@���z����>�>E����h�
9��]�X�2������Q��8�����)����F����lss�kn=I����;��_�q��!��IY�w���������O�f���%p�"�w���%���unZ����zv}��.���)��4s�����@���F:
��E.-��n��
���r���8O����j������2����z��"��mo��4��
[��-� ���-��R��4\q�$`
�1�.���_��/��O���B��&��=s�N�eIo�x[�F��z������"���<C*��n���D�)��Z�O���;X�6��m�e��'��NS��lRW��s���BrUm�\H;���2(=������;�I�P�Z�N�5<x	-���w�=K?��|���t�����z���h�s�mv���C/�X�����.�<*J�<�r[����8���#S��X�(0���W��9������JB4��1�	�Y�,:w$��t���tk��x�D�p��y^:899>��*�!p5�����Zv�6��>%8�����h��5�>��'Z��zAQ�&�	kl��#����c�q�a��e�v�'���.G`B��E<����������:L�iw�Z �1���]�"�l����J�k5����;���"LIg�G���_S�0��/wS�����j���51"��~8�u"Tu|'$��m���U4LM��
��Du����-�h8���"������xp�N�d��l@hCO���[��v�8��k�H����4�`��-x��!������K�=��K�"�+�����(�eWl�m.�v����+t�H:[�98=m����o`���+���'�a���`���d&�
U��������|�~Al�a��3��p�
��z�����GH���)Cp�����5�Z����ur���������x,4�==��[�k<���1�0��Q*=J6��|���Bh�c�1;g�7�+�L��-�B5��0o��F�=T�����z�������3U�G]|��O�
��r�4�������=#�\�4&�O!L�&�"�����a���
�?��z7�$�I7	x�a��f�}��8���
zI����Sq��=EK`��/q�T�0���H��+&���P>��Z1�� ����K%p{�@K�RA���������7�8�=��%<;~y�#��Ux9b�o��$]�L�CFg�~do����v��o��C����g(��o�3\�?�3�����.���=CN�y���^<�y�r���X�g8`�s�/�~1!����t���g��f3�4r;�������m]�*����F�� ���*3�f�N��U�[E��2kl�Z�o�n�o�s�a�S����m�[b�����&�>�B�V��W7���*��G~�c?�?�wu>�p�!7���+����{8����-Z�|x��S�X�0�����*�
<�K��)+�������-sV�U���t��>�\>�������NP�������)9���n��E4|�T��,t(l7{�"���RL��i�]�!�����T�n<U/��z���TC���~�B�	F�j��:!�"��D��&���\�����:������uFJhF��G1!���C*4������e������u���������N�m�x���4�n�>�= �B�L;�B�iB��"���e�=����v%35�C�pp��~iC�L��<��Gf��?f���}�%���n�7��VK��:&N.��������
�$��f�����N�+�:������)�Q�����2&p��+���z����W�TW��U����h$8�B�t]v�x�Z����A�F�e���1���
�+�vGU\t�>��}�w��yf���3`�H���04W��1�����H�EG�QW���S��C��d��<4)���kLYwz����No�l{[uq�hkc�k4w����@�_��_Q�O��DtU�X����yn��B�+��f������a���b�(.�9 ������7���3n�8�W����jV3�`{���+~�����'�IKhF���(G�4C�i���x�bh���c��U��`����op�0���#��e��

��Iy�Il��UhC"P
��w��I����k�7�;������TE5qfk"^�Y��ts�i�7!�=����2@�)����c�c��	z��2Kg�^� .0���������qw7�����.��|���H���lI4���/[Q���Z���������x>�:@se|��&�Av�}����gM+�$�~��'��x����f;�[��M�v?��Z�����8mh��{��)8u���>rM�1'y��@sa�z�(���U`E4�u�@!!D�����c<Qc���VT���Z�h��� yE�(����k�|
�G3(�-%���]��lh��!|(z��w�rV�5q���TG�x4�E�9L�E��4P���Z�����[�x�1Fg�o�NnJewf�C;��Bh|���C\Y�h.�GFo
��Hk��c �����B'���$����z��������	��6f�F��a��
�Fdz��-:N�,[�b�axS�"#��?�z��>�)v�T���^'���/�L�Vt_�YK>���J�v��
��R��XZ�5nX�g��.2�"O���6;����?8�6���%�B)��x����iKA������CQwd��@#��Q���W]�A��]4�a� \>NQ�r�>�s
a$4�F�2
����<'J1Z�u����e26@�����:�32����0M����3U���;;��.u=�Q���n8�S\\��E���v���-v(�)���.�BNq��*+�Ev��L%M�p8�f��y���~M�.�5���������@@*�����\H��M�&+ ������EV����b8�(�A&M�].�e2�3F���
�����/� S;�fM;%$��[�h���h��-����SE����a��MF���qZp��
'�����\�u�L���4'p���j�v*��%��%nS}�e�L-`���6U����z��m���m�(7����9�	>�V����C��� ��E�X�����i����y�q��(�~_Fx��
Se���������u�
5SN�h<���q�cd�;�QW�{������yBh�x��
�-=�Y���E�"�"+�;�DS�����a+b����PVn|3���;%�������Iuz7?Yu2E	�V�h���nZ�'�s�>��5��%�;���z��3�����i���r���������������oCB�����5m�JyE�,�,��gy6�����O~r�M�=�oY6���xry���\�,���Yt4�/����V�~[�}~p��9��5���9zj����.77d�2��K���,`H����$��Pvs�����y���U�uTe��79K�y�Z���V���lj	�tJ,\|@�����]n�	%+������bP�O�������7u��b�n7�Y�d~�H��[�t�]��������|�z�I���b��DGb��2W��<e��`�|b�#u�_�����[�N�8�.��3(6L�s�U��]�>yR*�*�+�Ja4gM�+C�A��vg�N��.�K�x������i8@�EA�o�p���,A�����;=�����������*�KRBLf�*���xQ������dm�cUp�O��){L)��	�? p�k]F��436"	\p; �O��&%��x2B���TUT��F�
�6�P���+�3'�M��5��Jzv��f�XI�������E������vWe�@G�S�Y�J����?H
V�Z��+o��Kt�W����~����a�Q����$��I�K���y�]q��l��k�;��"�4��`j�2�:�?�B��F&4UjH�0���3����&��r�N9�w�L����~Q���K\���g3�^d���z,B��r���+��]�����3��r]3-i9|�Z�<����1�R^�9�x
SN�Z��^A�4I�q=f�Ugb�0#T�����/�����/���a��w�0w��{A���8��+u~}����.�����r2� 9�&���	9s�V��K���y)G�������}r�3���#�?"]*�
_'5�w���o�w�
R<����'���p"��7�]��YK_��^�(�p<�n�m|5�hH��
��jnb�J��C�mM�xQ���%%����W���z�@����c�
M��Z��������43�2&�{n���%t����F�"��NY�b�DE�oy���G�lP����0���h�k�WHr��BL�]r�S�A���6��?�*/AL���W��E���,�����]G2����y��\��*h"8���q0�����b��������H��X���m!Q���`�@��Q����Y31��[��w��!�];�g�*��=���&Y�&��4���gr1�7[8���D2f�G�����,�B������������a��>,�f�1\�0�Vkj:��.��������!�R:F�6u�1�[.\,:} ]/fV�3�B69�dZ�\
M����? �1������u��T�ab���7��=��|.G,L~�\�t[7-��w���r\?���I7������9�{���O�����1';�gM���o�\���.�0)��\�;Z�K����;i����>5�=�7M�d��������d��p�XU�9�i���K��Q�t�WK������3'u�G�f��g#���g��K�M�����:%�2�U��.�����}K��7��w��HzY�p?�e���J�������V����\Vq��{���C�:���vD��a�h}�:���G�����k1\���i�����'p'x'8�C`a
sq�=h8h��G�,6_iC��Eo8�4$����1k;�[sO7��"����
��0���]��[s��);&-(��%V<�P��]�LO.��f�t��w�0�BA��.�� �����~?�.��Y��X�#��T��(�@�G��������uxc�.�Y����������`��-g
���d�<s��'��p���{��\��|��m�C\"T�H0[�w�����T?�,��*0)�L���_��U��Ao�?����%� ��������s���x�]�`9�ao}���x��%�$�xQc����?��-��-_���ap��C|/P�z}��z�&d��������vz��~�N�������-�S�4_���m��zs�����-����3		���X-x�����b��R�?q4N0���+��(���o�J��7��s����Eax���9U����!V��$���/�,��w��d|J���3.]�x�DD|�*H�X�H��/8��}�A����(���\x+������u���G/�����'�O�\��gn����;��-G��������>�Ammt��� �v�7�\���/	V�L)\�,7M���52k}S�
�2�mm�>�%.�M���E�~���>�
���p�`���E�����fu�J>m����,B#E���I�|y^�a�������<���@��x���s�\y4y��P���we@\����6�bK���zw+bf�'�A�;��F>�����
������������o{�FZ��;���a����%fBb>�t�>�������i����K������]/9o;W=c��a�+!����A����2�X���O�<+2v�>�<�����#���&D}{4rM3���o%����wE0x|w�q7	h��Q[B�[^cC\vM����j�@>MN+:�����I�ns�^��p�w' �`�����D�>F�lz��1�H��T��N\q6���d*J�VSn��wx���@�����!?�8�o�i7�bx�����2����}�����`j����n���!���{��%LD���������N���SAL�l������9�Up����h?����)�8�c�Lqk2��CqU����%E���L��42��ff�(��.58�!��B��o����;�^�FI_��&c�l8�}.�/@�rw��
��
���-��SOu��D���R�
���{9U~5>�C��?�����x�eK��������#��	���)��M�DS;�"A��J�S<.�QGQl-.��%�7�����^j��F�k�+@%��
��\�l�9H�i��������;����ynmM�s[em���ot���rA���cI|}��vQb���������������h����\t�������N�_0��|�B����9��eQ��n$hI��V���h������������O�����������)���>�* _��)�4N���'\��@��������p>�uZ���D�I>��j��jNM[�����0����:F�d�E��.c�)��	;p9%:�r�@/
��9��#$���c[�)�@��F~l|k]�/�7�4�U�4��f�`N-�MAP��u���SM�v�S}�n4�I}�������,��7�az���U0�T���
zbJ��5u�uy����c���.J!�n���.s��n��O>����\�@������w���k
���GNc"����k��P.U��)d�p8�g�TFT/�����C����W�,�zs$1n�%���M�X��~��Z0�-fj����O�������\��G������CSJ��m��=_��������*=!qxf�.���~��	m	-����>D��b$�Fi�Cfo-�P�U��xf*��q�w�����nfXF?
b8�e8��O� ���[����0j`�@x�AO��\������Oj���c��*o�8>�[V�f��D�_��+���+'�7�nZ<�n)��q�Vy����������7��g���/sXC[��i�d����q��������������X����f������s��W)��sQ����]��g|�o�e�
^On}B����~������m6`�.t�f�Y�~�t7��Zs����� w�~����vY_����~	&��$h����JJ�>�[E������w�!��|�mn��7=�0�����'�Ut���#�rQM�B���gx�Qi�<�7��V[������bh��h����������F���^e���A�Z|����x�.��&!�
n��-�=�p-Bho!|�4����\�s�����S���,Q�f����[��A���Y�n6��e]�)��aV�q)m��h|	3�����>�GYn�x�'6�4��	^^���ee�.�<����*�H�=�����:A<������wf�������E�[U���|�����o�pSm���(p�n��P���F�T���U��c�n�p9���R[���j5{�DS����!^k�e�����
����Q���]gY�Gj5O2
CB�5q�!�����f����T>��r�����8}��e*�akI�]j:	���e,�*h���O�W����J�J
��\_�0�kv�R(x���(�������^�U�oUio�Y�u�*�$h�s&�4��
F�E�I�e3�������1��GG�q���������{F�=�'b
��:
�E��y���v��;+)4���'l^S/���F�/��Wqa`����5b���N�!��0ru�����~n����)���r��Q�y�����|1�����t��
#33Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Alexander Korotkov (#32)
Re: Statistics and selectivity estimation for ranges

On 04.01.2013 10:42, Alexander Korotkov wrote:

/*
* Calculate selectivity of "&&" operator using histograms of range lower bounds
* and histogram of range lengths.
*/
static double
calc_hist_selectivity_overlap(TypeCacheEntry *typcache, RangeBound *lower,
RangeBound *upper, RangeBound *hist_lower, int hist_nvalues,
Datum *length_hist_values, int length_hist_nvalues)

We already have code to estimate &&, based on the lower and upper bound
histograms:

case OID_RANGE_OVERLAP_OP:
case OID_RANGE_CONTAINS_ELEM_OP:
/*
* A && B <=> NOT (A << B OR A >> B).
*
* "range @> elem" is equivalent to "range && [elem,elem]". The
* caller already constructed the singular range from the element
* constant, so just treat it the same as &&.
*/
hist_selec =
calc_hist_selectivity_scalar(typcache, &const_lower, hist_upper,
nhist, false);
hist_selec +=
(1.0 - calc_hist_selectivity_scalar(typcache, &const_upper, hist_lower,
nhist, true));
hist_selec = 1.0 - hist_selec;
break;

I don't think the method based on lower bound and length histograms is
any better. In fact, my gut feeling is that it's less accurate. I'd
suggest dropping that part of the patch.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#33)
Re: Statistics and selectivity estimation for ranges

On Wed, Feb 13, 2013 at 5:28 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 04.01.2013 10:42, Alexander Korotkov wrote:

/*
* Calculate selectivity of "&&" operator using histograms of range lower
bounds
* and histogram of range lengths.
*/
static double
calc_hist_selectivity_overlap(**TypeCacheEntry *typcache, RangeBound
*lower,
RangeBound *upper, RangeBound
*hist_lower, int hist_nvalues,
Datum
*length_hist_values, int length_hist_nvalues)

We already have code to estimate &&, based on the lower and upper bound
histograms:

case OID_RANGE_OVERLAP_OP:

case OID_RANGE_CONTAINS_ELEM_OP:
/*
* A && B <=> NOT (A << B OR A >> B).
*
* "range @> elem" is equivalent to "range &&
[elem,elem]". The
* caller already constructed the singular range
from the element
* constant, so just treat it the same as &&.
*/
hist_selec =
calc_hist_selectivity_scalar(**typcache,
&const_lower, hist_upper,

nhist, false);
hist_selec +=
(1.0 - calc_hist_selectivity_scalar(**typcache,
&const_upper, hist_lower,

nhist, true));
hist_selec = 1.0 - hist_selec;
break;

I don't think the method based on lower bound and length histograms is any
better. In fact, my gut feeling is that it's less accurate. I'd suggest
dropping that part of the patch.

Right. This estimation has an accuracy of histogram, while estimation based
on lower bound and length histograms rely on additional assumption about
independence of lower bound and length histogram. We can sum A << B and A

B probabilities because they are mutually exclusive. It's pretty evident

but I would like to mention it in the comments, because typical assumption
about events in statistics calculation is their independence.

------
With best regards,
Alexander Korotkov.

#35Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#34)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On Wed, Feb 13, 2013 at 5:55 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Wed, Feb 13, 2013 at 5:28 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 04.01.2013 10:42, Alexander Korotkov wrote:

/*
* Calculate selectivity of "&&" operator using histograms of range
lower bounds
* and histogram of range lengths.
*/
static double
calc_hist_selectivity_overlap(**TypeCacheEntry *typcache, RangeBound
*lower,
RangeBound *upper, RangeBound
*hist_lower, int hist_nvalues,
Datum
*length_hist_values, int length_hist_nvalues)

We already have code to estimate &&, based on the lower and upper bound
histograms:

case OID_RANGE_OVERLAP_OP:

case OID_RANGE_CONTAINS_ELEM_OP:
/*
* A && B <=> NOT (A << B OR A >> B).
*
* "range @> elem" is equivalent to "range &&
[elem,elem]". The
* caller already constructed the singular range
from the element
* constant, so just treat it the same as &&.
*/
hist_selec =
calc_hist_selectivity_scalar(**typcache,
&const_lower, hist_upper,

nhist, false);
hist_selec +=
(1.0 - calc_hist_selectivity_scalar(**typcache,
&const_upper, hist_lower,

nhist, true));
hist_selec = 1.0 - hist_selec;
break;

I don't think the method based on lower bound and length histograms is
any better. In fact, my gut feeling is that it's less accurate. I'd suggest
dropping that part of the patch.

Right. This estimation has an accuracy of histogram, while estimation
based on lower bound and length histograms rely on additional assumption
about independence of lower bound and length histogram. We can sum A << B
and A >> B probabilities because they are mutually exclusive. It's pretty
evident but I would like to mention it in the comments, because typical
assumption about events in statistics calculation is their independence.

These changes were made in attached patch.

------
With best regards,
Alexander Korotkov.

Attachments:

range_stat-0.11.patch.gzapplication/x-gzip; name=range_stat-0.11.patch.gzDownload
#36Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Alexander Korotkov (#35)
Re: Statistics and selectivity estimation for ranges

On 01.03.2013 16:22, Alexander Korotkov wrote:

These changes were made in attached patch.

Thanks.

I've been staring at this code for a very long time now, trying to
understand how the math in calc_hist_selectivity_contained works. I
think I understand it now, but it probably needs a lot more comments and
perhaps some refactoring, so that the next reader won't need to spend
hours deciphering it.

I'll walk through an example of a calc_hist_selectivity_contained
invocation, to verify that my understanding is correct. This isn't 100%
identical to how the function works; I explain it as if it holds some
temporary bins in memory and modifies them in steps, but in reality, it
keeps the bin distances and some other information only for the
current/previous bin it's processing, in local variables.

Assume that the query is "col <@ int4range(15, 50)", and the lower
bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram
looks like this:

Boundary 10 20 40 100 120
-+------+------+------+------+-
Fraction 0.25 0.25 0.25 0.25

Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same
proportion, 25%, of all the tuples in the table. The function first
finds the bins containing the lower and upper bounds, 15 and 55. All the
bins outside those bounds are ignored, as there are no matching tuples
there. The fractions and the bounds of first and last bin, ie. those
containing the lower and upper bounds, are adjusted according to the
boundary values, using linear interpolation. The lower bound, 15, falls
in the middle of the bin 10-20, and the upper bound, 55, splits the
40-100 bin at ratio of 1/5. The adjusted bins look like this:

Boundary 15 20 40 55
-+------+------+------+
Fraction 0.125 0.25 0.05

Next, we need to calculate what proportion of tuples in each bin has a
small enough length to be contained withing the query range. For that,
the distance of each bin boundary to the upper bound is calculated:

Distance 40 35 15 0
-+------+------+------+
Fraction 0.125 0.25 0.05

The bins are walked starting from the highest bin, ie. starting from
distance 0, walking up in increasing order of distance. For each bin,
the proportion of tuples within that range that have a suitable length
is calculated. The calc_length_hist_frac function does that. That
calculation is more complicated than it sounds: for example, for the
middle bin above, calc_length_hist_frac is passed both distance
boundaries, 15 and 35. calc_length_hist frac calculates the average of
P(x), when x slides linearly from 15 to 35, where P(x) is the fraction
of tuples with length <= x.

Now, here's a question, on something I didn't quite understand:

* Returns average fraction of histogram which is greater than given length.
* While this length is increasing from length1 to *length2. If histogram
* ends up before *length2 then set covered fraction of (length1, *length2)
* interval to *fraction and set end of histogram to *length2.
*/
static double
calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
double length1, double *length2, double *fraction)
{

Why the behavior explained in the last sentence in the above comment? It
seems that the abstraction provided by calc_length_hist_frac() is leaky;
the caller shouldn't need to know anything about the boundaries of the
length bins.

Ignoring that, I believe that calc_length_hist_frac can also be
explained like this:

/*
* Let P(x) be the fraction of tuples with length <= x.
*
* calc_length_hist_frac calculates the average of P(x), in the interval [A, B].
*
* This can be calculated by the formula:
*
* B
* 1 /
* ------- | P(x)dx
* B - A /
* A
*/
static double
calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
double A, double B)

Am I correct this far? The function doesn't use the above formula as is,
but it could..

I'll continue trying to understand this and add comments..

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#36)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On 01.03.2013 16:22, Alexander Korotkov wrote:

I've been staring at this code for a very long time now, trying to
understand how the math in calc_hist_selectivity_contained works. I
think I understand it now, but it probably needs a lot more comments and
perhaps some refactoring, so that the next reader won't need to spend
hours deciphering it.

I'll walk through an example of a calc_hist_selectivity_contained
invocation, to verify that my understanding is correct. This isn't 100%
identical to how the function works; I explain it as if it holds some
temporary bins in memory and modifies them in steps, but in reality, it
keeps the bin distances and some other information only for the
current/previous bin it's processing, in local variables.

Assume that the query is "col <@ int4range(15, 50)", and the lower
bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram
looks like this:

Boundary 10 20 40 100 120
-+------+------+------+------+-
Fraction 0.25 0.25 0.25 0.25

Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same
proportion, 25%, of all the tuples in the table. The function first
finds the bins containing the lower and upper bounds, 15 and 55. All the
bins outside those bounds are ignored, as there are no matching tuples
there. The fractions and the bounds of first and last bin, ie. those
containing the lower and upper bounds, are adjusted according to the
boundary values, using linear interpolation. The lower bound, 15, falls
in the middle of the bin 10-20, and the upper bound, 55, splits the
40-100 bin at ratio of 1/5. The adjusted bins look like this:

Boundary 15 20 40 55
-+------+------+------+
Fraction 0.125 0.25 0.05

Next, we need to calculate what proportion of tuples in each bin has a
small enough length to be contained withing the query range. For that,
the distance of each bin boundary to the upper bound is calculated:

Distance 40 35 15 0
-+------+------+------+
Fraction 0.125 0.25 0.05

The bins are walked starting from the highest bin, ie. starting from
distance 0, walking up in increasing order of distance. For each bin,
the proportion of tuples within that range that have a suitable length
is calculated. The calc_length_hist_frac function does that. That
calculation is more complicated than it sounds: for example, for the
middle bin above, calc_length_hist_frac is passed both distance
boundaries, 15 and 35. calc_length_hist frac calculates the average of
P(x), when x slides linearly from 15 to 35, where P(x) is the fraction
of tuples with length <= x.

Now, here's a question, on something I didn't quite understand:

* Returns average fraction of histogram which is greater than given
length.
* While this length is increasing from length1 to *length2. If histogram
* ends up before *length2 then set covered fraction of (length1,
*length2)
* interval to *fraction and set end of histogram to *length2.
*/
static double
calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
double length1, double *length2, double *fraction)
{

Why the behavior explained in the last sentence in the above comment? It
seems that the abstraction provided by calc_length_hist_frac() is leaky;
the caller shouldn't need to know anything about the boundaries of the
length bins.

Ignoring that, I believe that calc_length_hist_frac can also be
explained like this:

/*
* Let P(x) be the fraction of tuples with length <= x.
*
* calc_length_hist_frac calculates the average of P(x), in the
interval [A, B].
*
* This can be calculated by the formula:
*
* B
* 1 /
* ------- | P(x)dx
* B - A /
* A
*/
static double
calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
double A, double B)

Am I correct this far? The function doesn't use the above formula as is,
but it could..

I'll continue trying to understand this and add comments..

So, after some hacking, I ended up with this version. I find it more
readable, I hope I didn't miss anything. This seems to produce results
that are close, but not identical, to the original patch. I'm not sure
where the discrepancy is coming from, or which patch is more correct in
that respect. I'll continue from this tomorrow, but if you have the
time, please take a look and let me know what you think.

- Heikki

Attachments:

rangestats-0.12-heikki.patchtext/x-diff; name=rangestats-0.12-heikki.patchDownload
*** a/src/backend/utils/adt/rangetypes_selfuncs.c
--- b/src/backend/utils/adt/rangetypes_selfuncs.c
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/htup_details.h"
  #include "catalog/pg_operator.h"
  #include "catalog/pg_statistic.h"
+ #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  #include "utils/rangetypes.h"
  #include "utils/selfuncs.h"
***************
*** 39,44 **** static int rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value,
--- 40,58 ----
  			   RangeBound *hist, int hist_length, bool equal);
  static float8 get_position(TypeCacheEntry *typcache, RangeBound *value,
  			 RangeBound *hist1, RangeBound *hist2);
+ static float8 get_len_position(double value, double hist1, double hist2);
+ static float8 get_distance(TypeCacheEntry *typcache, RangeBound *bound1,
+ 															RangeBound *bound2);
+ static int length_hist_bsearch(Datum *length_hist_values,
+ 										int length_hist_nvalues, double value);
+ static double calc_length_hist_frac(Datum *length_hist_values,
+ 	int length_hist_nvalues, double length1, double length2);
+ static double calc_hist_selectivity_contained(TypeCacheEntry *typcache,
+ 				RangeBound *lower, RangeBound *upper, RangeBound *hist_lower,
+ 		int hist_nvalues, Datum *length_hist_values, int length_hist_nvalues);
+ static double calc_hist_selectivity_contains(TypeCacheEntry *typcache,
+ 				RangeBound *lower, RangeBound *upper, RangeBound *hist_lower,
+ 		int hist_nvalues, Datum *length_hist_values, int length_hist_nvalues);
  
  /*
   * Returns a default selectivity estimate for given operator, when we don't
***************
*** 213,219 **** calc_rangesel(TypeCacheEntry *typcache, VariableStatData *vardata,
  		/* Try to get fraction of empty ranges */
  		if (get_attstatsslot(vardata->statsTuple,
  							 vardata->atttype, vardata->atttypmod,
! 							 STATISTIC_KIND_RANGE_EMPTY_FRAC, InvalidOid,
  							 NULL,
  							 NULL, NULL,
  							 &numbers, &nnumbers))
--- 227,233 ----
  		/* Try to get fraction of empty ranges */
  		if (get_attstatsslot(vardata->statsTuple,
  							 vardata->atttype, vardata->atttypmod,
! 							 STATISTIC_KIND_LENGTH_HISTOGRAM, InvalidOid,
  							 NULL,
  							 NULL, NULL,
  							 &numbers, &nnumbers))
***************
*** 332,337 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 346,353 ----
  {
  	Datum	   *hist_values;
  	int			nhist;
+ 	Datum	   *length_hist_values;
+ 	int			length_nhist;
  	RangeBound *hist_lower;
  	RangeBound *hist_upper;
  	int			i;
***************
*** 365,370 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 381,400 ----
  			elog(ERROR, "bounds histogram contains an empty range");
  	}
  
+ 	/* @> and @< also need a histogram of range lengths */
+ 	if (operator == OID_RANGE_CONTAINS_OP ||
+ 		operator == OID_RANGE_CONTAINED_OP)
+ 	{
+ 		if (!(HeapTupleIsValid(vardata->statsTuple) &&
+ 			  get_attstatsslot(vardata->statsTuple,
+ 							   vardata->atttype, vardata->atttypmod,
+ 							   STATISTIC_KIND_LENGTH_HISTOGRAM, InvalidOid,
+ 							   NULL,
+ 							   &length_hist_values, &length_nhist,
+ 							   NULL, NULL)))
+ 			return -1.0;
+ 	}
+ 
  	/* Extract the bounds of the constant value. */
  	range_deserialize(typcache, constval, &const_lower, &const_upper, &empty);
  	Assert (!empty);
***************
*** 440,445 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 470,478 ----
  			/*
  			 * A && B <=> NOT (A << B OR A >> B).
  			 *
+ 			 * Since A << B and A >> B are mutually exclusive events we can sum
+ 			 * their probabilities to find probability of (A << B OR A >> B).
+ 			 *
  			 * "range @> elem" is equivalent to "range && [elem,elem]". The
  			 * caller already constructed the singular range from the element
  			 * constant, so just treat it the same as &&.
***************
*** 454,462 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
  			break;
  
  		case OID_RANGE_CONTAINS_OP:
  		case OID_RANGE_CONTAINED_OP:
! 			/* TODO: not implemented yet */
! 			hist_selec = -1.0;
  			break;
  
  		default:
--- 487,522 ----
  			break;
  
  		case OID_RANGE_CONTAINS_OP:
+ 			hist_selec =
+ 				calc_hist_selectivity_contains(typcache, &const_lower,
+ 											&const_upper, hist_lower, nhist,
+ 											length_hist_values, length_nhist);
+ 			break;
+ 
  		case OID_RANGE_CONTAINED_OP:
! 			if (const_lower.infinite)
! 			{
! 				/*
! 				 * Lower bound no longer matters. Just estimate the fraction
! 				 * with an upper bound <= const uppert bound
! 				 */
! 				hist_selec =
! 					calc_hist_selectivity_scalar(typcache, &const_upper,
! 												 hist_upper, nhist, true);
! 			}
! 			else if (const_upper.infinite)
! 			{
! 				hist_selec =
! 					1.0 - calc_hist_selectivity_scalar(typcache, &const_lower,
! 													   hist_lower, nhist, false);
! 			}
! 			else
! 			{
! 				hist_selec =
! 					calc_hist_selectivity_contained(typcache, &const_lower,
! 													&const_upper, hist_lower, nhist,
! 													length_hist_values, length_nhist);
! 			}
  			break;
  
  		default:
***************
*** 497,504 **** calc_hist_selectivity_scalar(TypeCacheEntry *typcache, RangeBound *constbound,
  
  /*
   * Binary search on an array of range bounds. Returns greatest index of range
!  * bound in array which is less than given range bound. If all range bounds in
!  * array are greater or equal than given range bound, return -1.
   */
  static int
  rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
--- 557,569 ----
  
  /*
   * Binary search on an array of range bounds. Returns greatest index of range
!  * bound in array which is less(less or equal) than given range bound. If all
!  * range bounds in array are greater or equal(greater) than given range bound,
!  * return -1. When "equal" flag is set conditions in brackets are used.
!  *
!  * This function is used in scalar operators selectivity estimation. Another
!  * goal of this function is to found an histogram bin where to stop
!  * interpolation of portion of bounds which are less or equal to given bound.
   */
  static int
  rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
***************
*** 522,527 **** rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
--- 587,617 ----
  	return lower;
  }
  
+ 
+ /*
+  * Binary search on length histogram. Returns greatest index of range length in
+  * histogram which is less than the given length value.
+  */
+ static int
+ length_hist_bsearch(Datum *length_hist_values, int length_hist_nvalues,
+ 					double value)
+ {
+ 	int			lower = -1,
+ 				upper = length_hist_nvalues - 1,
+ 				middle;
+ 
+ 	while (lower < upper)
+ 	{
+ 		middle = (lower + upper + 1) / 2;
+ 
+ 		if (DatumGetFloat8(length_hist_values[middle]) < value)
+ 			lower = middle;
+ 		else
+ 			upper = middle - 1;
+ 	}
+ 	return lower;
+ }
+ 
  /*
   * Get relative position of value in histogram bin in [0,1] range.
   */
***************
*** 598,600 **** get_position(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist1,
--- 688,1082 ----
  	}
  }
  
+ 
+ /*
+  * Get relative position of value in histogram bin in [0,1] range.
+  */
+ static double
+ get_len_position(double value, double hist1, double hist2)
+ {
+ 	if (!is_infinite(hist1) && !is_infinite(hist2))
+ 	{
+ 		/*
+ 		 * Both bounds are finite. Assuming the subtype's comparison function
+ 		 * works sanely, the value must be finite, too, because it lies
+ 		 * somewhere between the bounds. If it doesn't, just return something.
+ 		 */
+ 		if (is_infinite(value))
+ 			return 0.5;
+ 
+ 		/* Calculate relative position using subdiff function. */
+ 		{
+ 			double d = 1.0 - (hist2 - value) / (hist2 - hist1);
+ 			Assert(!is_infinite(d));
+ 			return d;
+ 		}
+ 	}
+ 	else if (is_infinite(hist1) && !is_infinite(hist2))
+ 	{
+ 		/*
+ 		 * Lower bin boundary is -infinite, upper is finite.
+ 		 * Return 1.0 to indicate the value is infinitely far from the lower
+ 		 * bound.
+ 		 */
+ 		return 1.0;
+ 	}
+ 	else if (is_infinite(hist1) && is_infinite(hist2))
+ 	{
+ 		/* same as above, but in reverse */
+ 		return 0.0;
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * If both bin boundaries are infinite, they should be equal to each
+ 		 * other, and the value should also be infinite and equal to both
+ 		 * bounds. (But don't Assert that, to avoid crashing if a user creates
+ 		 * a datatype with a broken comparison function).
+ 		 *
+ 		 * Assume the value to lie in the middle of the infinite bounds.
+ 		 */
+ 		return 0.5;
+ 	}
+ }
+ 
+ /*
+  * Measure distance between two range bounds.
+  */
+ static float8
+ get_distance(TypeCacheEntry *typcache, RangeBound *bound1, RangeBound *bound2)
+ {
+ 	bool	has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
+ 
+ 	if (!bound1->infinite && !bound2->infinite)
+ 	{
+ 		/*
+ 		 * No bounds are infinite, use subdiff function or return default
+ 		 * value of 1.0 if no subdiff is available.
+ 		 */
+ 		if (has_subdiff)
+ 			return
+ 				DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
+ 												 typcache->rng_collation,
+ 												 bound2->val,
+ 												 bound1->val));
+ 		else
+ 			return 1.0;
+ 	}
+ 	else if (bound1->infinite && bound2->infinite)
+ 	{
+ 		/* Both bounds are infinite */
+ 		if (bound1->lower == bound2->lower)
+ 			return 0.0;
+ 		else
+ 			return get_float8_infinity();
+ 	}
+ 	else
+ 	{
+ 		/* One bound is infinite, another is not */
+ 		return get_float8_infinity();
+ 	}
+ }
+ 
+ /*
+  * Calculate the average of function P(x), in the interval [length1, length2],
+  * where P(x) is the fraction of tuples with length <= x.
+  */
+ static double
+ calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
+ 					  double length1, double length2)
+ {
+ 	double		frac;
+ 	double		A, B, PA, PB;
+ 	double		pos;
+ 	int			i;
+ 	double		area;
+ 
+ 	Assert(length2 >= length1);
+ 
+ 	/*----------
+ 	 * The average of a function between A and B can be calculated by the
+ 	 * formula:
+ 	 *
+ 	 *          B
+ 	 *    1     /
+ 	 * -------  | P(x)dx
+ 	 *  B - A   /
+ 	 *          A
+ 	 *
+ 	 * The geometrical interpretation is that the integral is the area under
+ 	 * the graph of P(x). P(x) is defined by the length histogram. We
+ 	 * calculate the area in a piecewise fashion, iterating through the length
+ 	 * histogram bins. Each bin is a trapezoid:
+ 	 *
+ 	 *       P(x2)
+ 	 *        /|
+ 	 *       / |
+ 	 * P(x1)/  |
+ 	 *     |   |
+ 	 *     |   |
+ 	 *  ---+---+--
+ 	 *     x1  x2
+ 	 *
+ 	 * where x1 and x2 are the boundaries of the current histogram, and P(x1)
+ 	 * and P(x1) are the cumulative fraction of tuples at the boundaries.
+ 	 *
+ 	 * The area of each trapezoid is 1/2 * (P(x2) + P(x1)) * (x2 - x1)
+ 	 *
+ 	 * The first bin contains the lower bound passed by the caller, so we
+ 	 * use linear interpolation between the previous and next histogram bin
+ 	 * boundary to calculate P(x1). Likewise for the last bin: we use linear
+ 	 * interpolation to calculate P(x2). For the bins in between, x1 and x2
+ 	 * lie on histogram bin boundaries, so P(x1) and P(x2) are simply:
+ 	 * P(x1) =    (bin index) / (number of bins)
+ 	 * P(x2) = (bin index + 1 / (number of bins)
+ 	 */
+ 
+ 	/* First bin, the one that contains lower bound */
+ 	i = length_hist_bsearch(length_hist_values, length_hist_nvalues, length1);
+ 	if (i >= length_hist_nvalues - 1)
+ 		return 1.0;
+ 
+ 	if (i < 0)
+ 	{
+ 		i = 0;
+ 		pos = 0.0;
+ 	}
+ 	else
+ 	{
+ 		/* interpolate length1's position in the bin */
+ 		pos = get_len_position(length1,
+ 							   DatumGetFloat8(length_hist_values[i]),
+ 							   DatumGetFloat8(length_hist_values[i + 1]));
+ 	}
+ 	PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
+ 	B = length1;
+ 
+ 	/*
+ 	 * In the degenerate case that length1 == length2, simply return P(length1).
+ 	 * This is not merely an optimization: if length1 == length2, we'd divide
+ 	 * by zero later on.
+ 	 */
+ 	if (length2 == length1)
+ 		return PB;
+ 
+ 	/*
+ 	 * Loop through all the bins, until we hit the last bin, the one that
+ 	 * contains the upper bound. (if lower and upper bounds are in the same
+ 	 * bin, this falls out immediately)
+ 	 */
+ 	area = 0.0;
+ 	for (; i < length_hist_nvalues - 1 && DatumGetFloat8(length_hist_values[i + 1]) < length2; i++)
+ 	{
+ 		/* the upper bound of previous bin is the lower bound of this bin */
+ 		A = B; PA = PB;
+ 
+ 		B = DatumGetFloat8(length_hist_values[i + 1]);
+ 		PB = (double) i / (double) (length_hist_nvalues - 1);
+ 
+ 		area += 0.5 * (PB + PA) * (B - A);
+ 	}
+ 
+ 	/* Last bin */
+ 	A = B; PA = PB;
+ 
+ 	B = length2; /* last bin ends at the query upper bound */
+ 	if (i >= length_hist_nvalues - 1)
+ 		pos = 0.0;
+ 	else
+ 	{
+ 		if (DatumGetFloat8(length_hist_values[i]) == DatumGetFloat8(length_hist_values[i + 1]))
+ 			pos = 0.0;
+ 		else
+ 			pos = get_len_position(length2, DatumGetFloat8(length_hist_values[i]), DatumGetFloat8(length_hist_values[i + 1]));
+ 	}
+ 	PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
+ 
+ 	area += 0.5 * (PB + PA) * (B - A);
+ 
+ 	/*
+ 	 * Ok, we have calculated the area, ie. the integral. Divide by width to
+ 	 * get the requested average.
+ 	 */
+ 	frac = area / (length2 - length1);
+ 
+ 	return frac;
+ }
+ 
+ /*
+  * Calculate selectivity of "<@" operator using histograms of range lower bounds
+  * and histogram of range lengths.
+  *
+  * Note, this is "var <@ const", ie. estimate the fraction of ranges that
+  * fall within the constant lower and upper bounds.
+  *
+  * The caller has already checked that lower and upper are finite
+  */
+ static double
+ calc_hist_selectivity_contained(TypeCacheEntry *typcache,
+ 								RangeBound *lower, RangeBound *upper,
+ 								RangeBound *hist_lower,	int hist_nvalues,
+ 								Datum *length_hist_values, int length_hist_nvalues)
+ {
+ 	int			i,
+ 				upper_index;
+ 	float8		prev_dist;
+ 	double		bin_width;
+ 	double		upper_frac;
+ 	double		sum_frac;
+ 
+ 	/*
+ 	 * Begin by finding the bin containing the upper bound, in the lower bound
+ 	 * histogram. Any range with a lower bound > constant upper bound can't
+ 	 * match, ie. there are no matches in bins greater than upper_index.
+ 	 */
+ 	upper->inclusive = !upper->inclusive;
+ 	upper->lower = true;
+ 	upper_index = rbound_bsearch(typcache, upper, hist_lower, hist_nvalues, false);
+ 
+ 	/*
+ 	 * Calculate upper_frac -- fraction of (upper_index, upper_index + 1) bin
+ 	 * which is greater than upper bound of query range using linear
+ 	 * interpolation of subdiff function. Adjust selectivity with it.
+ 	 */
+ 	if (upper_index >= 0 && upper_index < hist_nvalues - 1)
+ 		upper_frac = get_position(typcache, upper, &hist_lower[upper_index],
+ 								  &hist_lower[upper_index + 1]);
+ 	else
+ 		upper_frac = 0.0;
+ 
+ 	/*
+ 	 * bin_width represents the width of the current bin. Normally it is 1.0,
+ 	 * meaning a full width bin, but can be less in the corner cases: start
+ 	 * and end of the loop. We start with hist_frac = upper_frac, because we
+ 	 * begin at the bin containing the upper bound.
+ 	 *
+ 	 * In the loop, dist and prev_dist are the distance of the "current" bin's
+ 	 * lower and upper bounds from the from the constant upper bound.
+ 	 */
+ 	bin_width = upper_frac;
+ 	prev_dist = 0.0;
+ 
+ 	sum_frac = 0.0;
+ 	for (i = upper_index; i >= 0; i--)
+ 	{
+ 		double		dist;
+ 		double		length_hist_frac;
+ 		bool		final_bin = false;
+ 
+ 		/*
+ 		 * dist -- distance from upper bound of query range to current
+ 		 * value of lower bound histogram or lower bound of query range (if
+ 		 * we've reached it).
+ 		 */
+ 		if (range_cmp_bounds(typcache, &hist_lower[i], lower) < 0)
+ 		{
+ 			dist = get_distance(typcache, lower, upper);
+ 			/*
+ 			 * Adjust bin width, as we are only taking into account a part of
+ 			 * this bin.
+ 			 */
+ 			bin_width *= get_position(typcache, lower, &hist_lower[i],
+ 									  &hist_lower[i + 1]);
+ 			final_bin = true;
+ 		}
+ 		else
+ 			dist = get_distance(typcache, &hist_lower[i], upper);
+ 
+ 		/*
+ 		 * Get average fraction of length histogram which covers intervals
+ 		 * longer than distance to upper bound of query range.
+ 		 */
+ 		length_hist_frac = calc_length_hist_frac(length_hist_values,
+ 			length_hist_nvalues, prev_dist, dist);
+ 
+ 		/*
+ 		 * Adjust overall selectivity according to current fraction of lower
+ 		 * bound histogram assuming not satisfying fraction of length
+ 		 * histogram to be average between this and previous steps.
+ 		 */
+ 		sum_frac += length_hist_frac * bin_width / (double) (hist_nvalues - 1);
+ 
+ 		if (final_bin)
+ 			break;
+ 
+ 		bin_width = 1.0;
+ 		prev_dist = dist;
+ 	}
+ 
+ 	return sum_frac;
+ }
+ 
+ /*
+  * Calculate selectivity of "@>" operator using histograms of range lower bounds
+  * and histogram of range lengths.
+  *
+  * Note, this is "var @> const", ie. estimate the fraction of ranges that
+  * contain the constant lower and upper bounds.
+  */
+ static double
+ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
+ 							   RangeBound *lower, RangeBound *upper,
+ 							   RangeBound *hist_lower, int hist_nvalues,
+ 							   Datum *length_hist_values, int length_hist_nvalues)
+ {
+ 	int			i,
+ 				lower_index;
+ 	float8		hist_frac,
+ 				lower_frac;
+ 	double		sum_frac;
+ 	float8		prev_dist;
+ 
+ 	/* Find index corresponding to lower bound of query range. */
+ 	lower_index = rbound_bsearch(typcache, lower, hist_lower, hist_nvalues, false);
+ 
+ 	/*
+ 	 * Calculate lower_frac -- fraction of (lower_index, lower_index + 1) bin
+ 	 * which is greater than lower bound of query range using linear
+ 	 * interpolation of subdiff function. Adjust selectivity with it.
+ 	 */
+ 	if (lower_index >= 0 && lower_index < hist_nvalues - 1)
+ 		lower_frac = get_position(typcache, lower, &hist_lower[lower_index],
+ 								  &hist_lower[lower_index + 1]);
+ 	else
+ 		lower_frac = 0.0;
+ 
+ 	/*
+ 	 * We start from selec value of 0. And then iterate on lower bounds
+ 	 * histogram from lower value of query range it's beginning. At each
+ 	 * step we add fraction of values in histogram bin which length are so
+ 	 * that they should contain query range. We use linear interpolation
+ 	 * for achieving more accurate result.
+ 	 */
+ 	prev_dist = get_distance(typcache, lower, upper);
+ 	sum_frac = 0.0;
+ 	hist_frac = lower_frac;
+ 	for (i = lower_index; i >= 0; i--)
+ 	{
+ 		float8		dist;
+ 		double		length_hist_frac;
+ 
+ 		/*
+ 		 * dist -- distance from upper bound of query range to current
+ 		 * value of lower bound histogram or lower bound of query range (if
+ 		 * we've reach it).
+ 		 */
+ 		dist = get_distance(typcache, &hist_lower[i], upper);
+ 
+ 		/*
+ 		 * Get average fraction of length histogram which covers intervals
+ 		 * longer than distance to upper bound of query range.
+ 		 */
+ 		length_hist_frac =
+ 			1.0 - calc_length_hist_frac(length_hist_values,
+ 										length_hist_nvalues,
+ 										prev_dist, dist);
+ 
+ 		sum_frac += length_hist_frac * hist_frac / (double) (hist_nvalues - 1);
+ 
+ 		hist_frac = 1.0;
+ 		prev_dist = dist;
+ 	}
+ 
+ 	return sum_frac;
+ }
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
***************
*** 29,34 ****
--- 29,36 ----
  #include "utils/builtins.h"
  #include "utils/rangetypes.h"
  
+ static int float8_qsort_cmp(const void *a1, const void *a2);
+ static int range_bound_qsort_cmp(const void *a1, const void *a2, void *arg);
  static void compute_range_stats(VacAttrStats *stats,
  		   AnalyzeAttrFetchFunc fetchfunc, int samplerows, double totalrows);
  
***************
*** 57,62 **** range_typanalyze(PG_FUNCTION_ARGS)
--- 59,84 ----
  }
  
  /*
+  * Comparison function for float8 which are used for measurement of range
+  * size.
+  */
+ static int
+ float8_qsort_cmp(const void *a1, const void *a2)
+ {
+ 	const float8 *f1,
+ 			   *f2;
+ 
+ 	f1 = (const float8 *) a1;
+ 	f2 = (const float8 *) a2;
+ 	if (*f1 < *f2)
+ 		return -1;
+ 	else if (*f1 == *f2)
+ 		return 0;
+ 	else
+ 		return 1;
+ }
+ 
+ /*
   * Comparison function for sorting RangeBounds.
   */
  static int
***************
*** 77,82 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 99,105 ----
  					int samplerows, double totalrows)
  {
  	TypeCacheEntry *typcache = (TypeCacheEntry *) stats->extra_data;
+ 	bool		has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
  	int			null_cnt = 0;
  	int			non_null_cnt = 0;
  	int			non_empty_cnt = 0;
***************
*** 85,94 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 108,119 ----
  	int			slot_idx;
  	int			num_bins = stats->attr->attstattarget;
  	int			num_hist;
+ 	float8	   *lengths;
  	RangeBound *lowers, *uppers;
  	double		total_width = 0;
  
  	/* Allocate memory for arrays of range bounds. */
+ 	lengths = (float8 *) palloc(sizeof(float8) * samplerows);
  	lowers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
  	uppers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
  
***************
*** 101,106 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 126,132 ----
  		RangeType  *range;
  		RangeBound	lower,
  					upper;
+ 		float8		length;
  
  		vacuum_delay_point();
  
***************
*** 122,127 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 148,179 ----
  		range = DatumGetRangeType(value);
  		range_deserialize(typcache, range, &lower, &upper, &empty);
  
+ 		if (empty)
+ 		{
+ 			/* Length of empty range is zero */
+ 			length = 0.0;
+ 		}
+ 		else if (lower.infinite || upper.infinite)
+ 		{
+ 			/* Length of any kind of infinite range is initite */
+ 			length = get_float8_infinity();
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * For ordinal range use subdiff function between upper and lower
+ 			 * bound values or use default value of 1.0 if no subdiff is
+ 			 * available.
+ 			 */
+ 			if (has_subdiff)
+ 				length = DatumGetFloat8(FunctionCall2Coll(
+ 												&typcache->rng_subdiff_finfo,
+ 												typcache->rng_collation,
+ 												upper.val, lower.val));
+ 			else
+ 				length = 1.0;
+ 		}
+ 
  		if (!empty)
  		{
  			/* Fill bound values for further usage in histograms */
***************
*** 132,137 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 184,190 ----
  		else
  			empty_cnt++;
  
+ 		lengths[non_null_cnt] = length;
  		non_null_cnt++;
  	}
  
***************
*** 141,146 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 194,200 ----
  	if (non_null_cnt > 0)
  	{
  		Datum	   *bound_hist_values;
+ 		Datum	   *length_hist_values;
  		int			pos,
  					posfrac,
  					delta,
***************
*** 210,219 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
  			slot_idx++;
  		}
  
  		/* Store the fraction of empty ranges */
  		emptyfrac = (float4 *) palloc(sizeof(float4));
  		*emptyfrac = ((double) empty_cnt) / ((double) non_null_cnt);
! 		stats->stakind[slot_idx] = STATISTIC_KIND_RANGE_EMPTY_FRAC;
  		stats->stanumbers[slot_idx] = emptyfrac;
  		stats->numnumbers[slot_idx] = 1;
  		slot_idx++;
--- 264,331 ----
  			slot_idx++;
  		}
  
+ 		stats->stakind[slot_idx] = STATISTIC_KIND_LENGTH_HISTOGRAM;
+ 		/*
+ 		 * Generate a length histogram slot entry if we've collected at least
+ 		 * two length values which can be not distinct.
+ 		 */
+ 		if (non_null_cnt >= 2)
+ 		{
+ 			/*
+ 			 * Ascending sort of range lengths for further filling of
+ 			 * histogram
+ 			 */
+ 			qsort(lengths, non_null_cnt, sizeof(float8), float8_qsort_cmp);
+ 
+ 			num_hist = non_null_cnt;
+ 			if (num_hist > num_bins)
+ 				num_hist = num_bins + 1;
+ 
+ 			length_hist_values = (Datum *) palloc(num_hist * sizeof(Datum));
+ 
+ 			/*
+ 			 * The object of this loop is to copy the first and last lengths[]
+ 			 * entries along with evenly-spaced values in between. So the i'th
+ 			 * value is lengths[(i * (nvals - 1)) / (num_hist - 1)]. But
+ 			 * computing that subscript directly risks integer overflow when the
+ 			 * stats target is more than a couple thousand.  Instead we add
+ 			 * (nvals - 1) / (num_hist - 1) to pos at each step, tracking the
+ 			 * integral and fractional parts of the sum separately.
+ 			 */
+ 			delta = (non_null_cnt - 1) / (num_hist - 1);
+ 			deltafrac = (non_null_cnt - 1) % (num_hist - 1);
+ 			pos = posfrac = 0;
+ 
+ 			for (i = 0; i < num_hist; i++)
+ 			{
+ 				length_hist_values[i] = Float8GetDatum(lengths[pos]);
+ 				pos += delta;
+ 				posfrac += deltafrac;
+ 				if (posfrac >= (num_hist - 1))
+ 				{
+ 					/* fractional part exceeds 1, carry to integer part */
+ 					pos++;
+ 					posfrac -= (num_hist - 1);
+ 				}
+ 			}
+ 			stats->staop[slot_idx] = Float8LessOperator;
+ 			stats->stavalues[slot_idx] = length_hist_values;
+ 			stats->numvalues[slot_idx] = num_hist;
+ 			/* We are storing float8 values */
+ 			stats->statypid[slot_idx] = FLOAT8OID;
+ 			stats->statyplen[slot_idx] = sizeof(float8);
+ #ifdef USE_FLOAT8_BYVAL
+ 			stats->statypbyval[slot_idx] = true;
+ #else
+ 			stats->statypbyval[slot_idx] = false;
+ #endif
+ 			stats->statypalign[slot_idx] = 'd';
+ 		}
+ 
  		/* Store the fraction of empty ranges */
  		emptyfrac = (float4 *) palloc(sizeof(float4));
  		*emptyfrac = ((double) empty_cnt) / ((double) non_null_cnt);
! 
  		stats->stanumbers[slot_idx] = emptyfrac;
  		stats->numnumbers[slot_idx] = 1;
  		slot_idx++;
*** a/src/include/catalog/pg_operator.h
--- b/src/include/catalog/pg_operator.h
***************
*** 527,532 **** DATA(insert OID = 671 (  "<>"	   PGNSP PGUID b f f	701  701	16 671 670 float8ne
--- 527,533 ----
  DESCR("not equal");
  DATA(insert OID = 672 (  "<"	   PGNSP PGUID b f f	701  701	16 674 675 float8lt scalarltsel scalarltjoinsel ));
  DESCR("less than");
+ #define Float8LessOperator	672
  DATA(insert OID = 673 (  "<="	   PGNSP PGUID b f f	701  701	16 675 674 float8le scalarltsel scalarltjoinsel ));
  DESCR("less than or equal");
  DATA(insert OID = 674 (  ">"	   PGNSP PGUID b f f	701  701	16 672 673 float8gt scalargtsel scalargtjoinsel ));
*** a/src/include/catalog/pg_statistic.h
--- b/src/include/catalog/pg_statistic.h
***************
*** 269,279 **** typedef FormData_pg_statistic *Form_pg_statistic;
  #define STATISTIC_KIND_DECHIST	5
  
  /*
!  * An "empty frac" slot describes the fraction of empty ranges in a range-type
!  * column.  stavalues is not used and should be NULL.  stanumbers contains a
!  * single entry, the fraction of empty ranges (0.0 to 1.0).
   */
! #define STATISTIC_KIND_RANGE_EMPTY_FRAC  6
  
  /*
   * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
--- 269,285 ----
  #define STATISTIC_KIND_DECHIST	5
  
  /*
!  * A "length histogram" slot describes the distribution of range length in rows
!  * of an array-type column. Only non-null rows are considered. We use subdiff
!  * function for measure lengths of non-empty and non-infinite range. That's why
!  * we store lengths in stavalues of float8[] type. The preceding M (>=2) members
!  * form a histogram that divides the population of lengths into M-1 bins of
!  * approximately equal population. The first of these is the minimum observed
!  * length, and the last the maximum. Contain 1 stanumber - fraction of empty
!  * ranges. If typanalyze function finds only one non-NULL range, it fills only
!  * stanumber, while stavalues would be NULL.
   */
! #define STATISTIC_KIND_LENGTH_HISTOGRAM	6
  
  /*
   * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
#38Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#37)
Re: Statistics and selectivity estimation for ranges

Hi!

Thanks for your work on this patch!

On Tue, Mar 12, 2013 at 8:03 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

So, after some hacking, I ended up with this version. I find it more
readable, I hope I didn't miss anything. This seems to produce results that
are close, but not identical, to the original patch. I'm not sure where the
discrepancy is coming from, or which patch is more correct in that respect.
I'll continue from this tomorrow, but if you have the time, please take a
look and let me know what you think.

I've read your explanation and version of patch. In general it seems
correct to me.
There is one point why I have breaked up abstraction in some functions is
infinities. For example, in calc_length_hist_frac one or both of length1
and length2 can be infinity. In the line

frac = area / (length2 - length1);

you can get NaN result. I've especially adjusted the code to get more of
less correct result in this case.

Another minor note about this line

bin_width *= get_position(typcache, lower, &hist_lower[i],
&hist_lower[i + 1]);

ITSM it sould looks like

bin_width -= 1.0 - get_position(typcache, lower, &hist_lower[i],
&hist_lower[i + 1]);

Imagine lower and upper bounds fall into same histogram bin. In this case
we should subtract lengths of both parts which were cut in the left and in
the right.

------
With best regards,
Alexander Korotkov.

#39Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Alexander Korotkov (#38)
1 attachment(s)
Re: Statistics and selectivity estimation for ranges

On 01.03.2013 16:22, Alexander Korotkov wrote:

On Tue, Mar 12, 2013 at 8:03 PM, Heikki Linnakangas<hlinnakangas@vmware.com

wrote:

So, after some hacking, I ended up with this version. I find it more
readable, I hope I didn't miss anything. This seems to produce results that
are close, but not identical, to the original patch. I'm not sure where the
discrepancy is coming from, or which patch is more correct in that respect.
I'll continue from this tomorrow, but if you have the time, please take a
look and let me know what you think.

I've read your explanation and version of patch. In general it seems
correct to me.
There is one point why I have breaked up abstraction in some functions is
infinities. For example, in calc_length_hist_frac one or both of length1
and length2 can be infinity. In the line

frac = area / (length2 - length1);

you can get NaN result. I've especially adjusted the code to get more of
less correct result in this case.

Hmm, good point. I think I managed to fix those cases in the attached
version. Is there any other corner case that I missed?

Another minor note about this line

bin_width *= get_position(typcache, lower,&hist_lower[i],
&hist_lower[i + 1]);

ITSM it sould looks like

bin_width -= 1.0 - get_position(typcache, lower,&hist_lower[i],
&hist_lower[i + 1]);

Imagine lower and upper bounds fall into same histogram bin. In this case
we should subtract lengths of both parts which were cut in the left and in
the right.

Yes, true. There's one negation too many above, though; should be:

bin_width -= get_position(typcache, lower,&hist_lower[i],
&hist_lower[i + 1]);

Fixed that. Barring any more issues, I'll read through this once more
tomorrow and commit.

- Heikki

Attachments:

rangestats-0.13-heikki.patchtext/x-diff; name=rangestats-0.13-heikki.patchDownload
*** a/src/backend/utils/adt/rangetypes_selfuncs.c
--- b/src/backend/utils/adt/rangetypes_selfuncs.c
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/htup_details.h"
  #include "catalog/pg_operator.h"
  #include "catalog/pg_statistic.h"
+ #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  #include "utils/rangetypes.h"
  #include "utils/selfuncs.h"
***************
*** 39,44 **** static int rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value,
--- 40,60 ----
  			   RangeBound *hist, int hist_length, bool equal);
  static float8 get_position(TypeCacheEntry *typcache, RangeBound *value,
  			 RangeBound *hist1, RangeBound *hist2);
+ static float8 get_len_position(double value, double hist1, double hist2);
+ static float8 get_distance(TypeCacheEntry *typcache, RangeBound *bound1,
+ 															RangeBound *bound2);
+ static int length_hist_bsearch(Datum *length_hist_values,
+ 					int length_hist_nvalues, double value, bool equal);
+ static double calc_length_hist_frac(Datum *length_hist_values,
+ 									int length_hist_nvalues, double length1, double length2, bool equal);
+ static double calc_hist_selectivity_contained(TypeCacheEntry *typcache,
+ 								RangeBound *lower, RangeBound *upper,
+ 								RangeBound *hist_lower, int hist_nvalues,
+ 							Datum *length_hist_values, int length_hist_nvalues);
+ static double calc_hist_selectivity_contains(TypeCacheEntry *typcache,
+ 							   RangeBound *lower, RangeBound *upper,
+ 							   RangeBound *hist_lower, int hist_nvalues,
+ 							Datum *length_hist_values, int length_hist_nvalues);
  
  /*
   * Returns a default selectivity estimate for given operator, when we don't
***************
*** 213,219 **** calc_rangesel(TypeCacheEntry *typcache, VariableStatData *vardata,
  		/* Try to get fraction of empty ranges */
  		if (get_attstatsslot(vardata->statsTuple,
  							 vardata->atttype, vardata->atttypmod,
! 							 STATISTIC_KIND_RANGE_EMPTY_FRAC, InvalidOid,
  							 NULL,
  							 NULL, NULL,
  							 &numbers, &nnumbers))
--- 229,235 ----
  		/* Try to get fraction of empty ranges */
  		if (get_attstatsslot(vardata->statsTuple,
  							 vardata->atttype, vardata->atttypmod,
! 							 STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, InvalidOid,
  							 NULL,
  							 NULL, NULL,
  							 &numbers, &nnumbers))
***************
*** 332,337 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 348,355 ----
  {
  	Datum	   *hist_values;
  	int			nhist;
+ 	Datum	   *length_hist_values;
+ 	int			length_nhist;
  	RangeBound *hist_lower;
  	RangeBound *hist_upper;
  	int			i;
***************
*** 365,370 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 383,403 ----
  			elog(ERROR, "bounds histogram contains an empty range");
  	}
  
+ 	/* @> and @< also need a histogram of range lengths */
+ 	if (operator == OID_RANGE_CONTAINS_OP ||
+ 		operator == OID_RANGE_CONTAINED_OP)
+ 	{
+ 		if (!(HeapTupleIsValid(vardata->statsTuple) &&
+ 			  get_attstatsslot(vardata->statsTuple,
+ 							   vardata->atttype, vardata->atttypmod,
+ 							   STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM,
+ 							   InvalidOid,
+ 							   NULL,
+ 							   &length_hist_values, &length_nhist,
+ 							   NULL, NULL)))
+ 			return -1.0;
+ 	}
+ 
  	/* Extract the bounds of the constant value. */
  	range_deserialize(typcache, constval, &const_lower, &const_upper, &empty);
  	Assert (!empty);
***************
*** 440,445 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 473,481 ----
  			/*
  			 * A && B <=> NOT (A << B OR A >> B).
  			 *
+ 			 * Since A << B and A >> B are mutually exclusive events we can sum
+ 			 * their probabilities to find probability of (A << B OR A >> B).
+ 			 *
  			 * "range @> elem" is equivalent to "range && [elem,elem]". The
  			 * caller already constructed the singular range from the element
  			 * constant, so just treat it the same as &&.
***************
*** 454,462 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
  			break;
  
  		case OID_RANGE_CONTAINS_OP:
  		case OID_RANGE_CONTAINED_OP:
! 			/* TODO: not implemented yet */
! 			hist_selec = -1.0;
  			break;
  
  		default:
--- 490,525 ----
  			break;
  
  		case OID_RANGE_CONTAINS_OP:
+ 			hist_selec =
+ 				calc_hist_selectivity_contains(typcache, &const_lower,
+ 											&const_upper, hist_lower, nhist,
+ 											length_hist_values, length_nhist);
+ 			break;
+ 
  		case OID_RANGE_CONTAINED_OP:
! 			if (const_lower.infinite)
! 			{
! 				/*
! 				 * Lower bound no longer matters. Just estimate the fraction
! 				 * with an upper bound <= const uppert bound
! 				 */
! 				hist_selec =
! 					calc_hist_selectivity_scalar(typcache, &const_upper,
! 												 hist_upper, nhist, true);
! 			}
! 			else if (const_upper.infinite)
! 			{
! 				hist_selec =
! 					1.0 - calc_hist_selectivity_scalar(typcache, &const_lower,
! 													   hist_lower, nhist, false);
! 			}
! 			else
! 			{
! 				hist_selec =
! 					calc_hist_selectivity_contained(typcache, &const_lower,
! 													&const_upper, hist_lower, nhist,
! 													length_hist_values, length_nhist);
! 			}
  			break;
  
  		default:
***************
*** 497,504 **** calc_hist_selectivity_scalar(TypeCacheEntry *typcache, RangeBound *constbound,
  
  /*
   * Binary search on an array of range bounds. Returns greatest index of range
!  * bound in array which is less than given range bound. If all range bounds in
!  * array are greater or equal than given range bound, return -1.
   */
  static int
  rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
--- 560,572 ----
  
  /*
   * Binary search on an array of range bounds. Returns greatest index of range
!  * bound in array which is less(less or equal) than given range bound. If all
!  * range bounds in array are greater or equal(greater) than given range bound,
!  * return -1. When "equal" flag is set conditions in brackets are used.
!  *
!  * This function is used in scalar operators selectivity estimation. Another
!  * goal of this function is to found an histogram bin where to stop
!  * interpolation of portion of bounds which are less or equal to given bound.
   */
  static int
  rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
***************
*** 522,527 **** rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
--- 590,625 ----
  	return lower;
  }
  
+ 
+ /*
+  * Binary search on length histogram. Returns greatest index of range length in
+  * histogram which is less than (less than or equal) the given length value. If
+  * all lengths in the histogram are greater than (greater than or equal) the
+  * given length, returns -1.
+  */
+ static int
+ length_hist_bsearch(Datum *length_hist_values, int length_hist_nvalues,
+ 					double value, bool equal)
+ {
+ 	int			lower = -1,
+ 				upper = length_hist_nvalues - 1,
+ 				middle;
+ 
+ 	while (lower < upper)
+ 	{
+ 		double middleval;
+ 
+ 		middle = (lower + upper + 1) / 2;
+ 
+ 		middleval = DatumGetFloat8(length_hist_values[middle]);
+ 		if (middleval < value || (equal && middleval <= value))
+ 			lower = middle;
+ 		else
+ 			upper = middle - 1;
+ 	}
+ 	return lower;
+ }
+ 
  /*
   * Get relative position of value in histogram bin in [0,1] range.
   */
***************
*** 598,600 **** get_position(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist1,
--- 696,1122 ----
  	}
  }
  
+ 
+ /*
+  * Get relative position of value in histogram bin in [0,1] range.
+  */
+ static double
+ get_len_position(double value, double hist1, double hist2)
+ {
+ 	if (!is_infinite(hist1) && !is_infinite(hist2))
+ 	{
+ 		/*
+ 		 * Both bounds are finite. Assuming the subtype's comparison function
+ 		 * works sanely, the value must be finite, too, because it lies
+ 		 * somewhere between the bounds. If it doesn't, just return something.
+ 		 */
+ 		if (is_infinite(value))
+ 			return 0.5;
+ 
+ 		/* Calculate relative position using subdiff function. */
+ 		{
+ 			double d = 1.0 - (hist2 - value) / (hist2 - hist1);
+ 			Assert(!is_infinite(d));
+ 			return d;
+ 		}
+ 	}
+ 	else if (is_infinite(hist1) && !is_infinite(hist2))
+ 	{
+ 		/*
+ 		 * Lower bin boundary is -infinite, upper is finite.
+ 		 * Return 1.0 to indicate the value is infinitely far from the lower
+ 		 * bound.
+ 		 */
+ 		return 1.0;
+ 	}
+ 	else if (is_infinite(hist1) && is_infinite(hist2))
+ 	{
+ 		/* same as above, but in reverse */
+ 		return 0.0;
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * If both bin boundaries are infinite, they should be equal to each
+ 		 * other, and the value should also be infinite and equal to both
+ 		 * bounds. (But don't Assert that, to avoid crashing if a user creates
+ 		 * a datatype with a broken comparison function).
+ 		 *
+ 		 * Assume the value to lie in the middle of the infinite bounds.
+ 		 */
+ 		return 0.5;
+ 	}
+ }
+ 
+ /*
+  * Measure distance between two range bounds.
+  */
+ static float8
+ get_distance(TypeCacheEntry *typcache, RangeBound *bound1, RangeBound *bound2)
+ {
+ 	bool	has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
+ 
+ 	if (!bound1->infinite && !bound2->infinite)
+ 	{
+ 		/*
+ 		 * No bounds are infinite, use subdiff function or return default
+ 		 * value of 1.0 if no subdiff is available.
+ 		 */
+ 		if (has_subdiff)
+ 			return
+ 				DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
+ 												 typcache->rng_collation,
+ 												 bound2->val,
+ 												 bound1->val));
+ 		else
+ 			return 1.0;
+ 	}
+ 	else if (bound1->infinite && bound2->infinite)
+ 	{
+ 		/* Both bounds are infinite */
+ 		if (bound1->lower == bound2->lower)
+ 			return 0.0;
+ 		else
+ 			return get_float8_infinity();
+ 	}
+ 	else
+ 	{
+ 		/* One bound is infinite, another is not */
+ 		return get_float8_infinity();
+ 	}
+ }
+ 
+ /*
+  * Calculate the average of function P(x), in the interval [length1, length2],
+  * where P(x) is the fraction of tuples with length < x (or length <= x if
+  * 'equal' is true).
+  */
+ static double
+ calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
+ 					  double length1, double length2, bool equal)
+ {
+ 	double		frac;
+ 	double		A, B, PA, PB;
+ 	double		pos;
+ 	int			i;
+ 	double		area;
+ 
+ 	Assert(length2 >= length1);
+ 
+ 	if (length2 < 0.0)
+ 		return 0.0; /* shouldn't happen, but doesn't hurt to check */
+ 
+ 	if (is_infinite(length2) && equal)
+ 		return 1.0;
+ 
+ 	/*----------
+ 	 * The average of a function between A and B can be calculated by the
+ 	 * formula:
+ 	 *
+ 	 *          B
+ 	 *    1     /
+ 	 * -------  | P(x)dx
+ 	 *  B - A   /
+ 	 *          A
+ 	 *
+ 	 * The geometrical interpretation is that the integral is the area under
+ 	 * the graph of P(x). P(x) is defined by the length histogram. We
+ 	 * calculate the area in a piecewise fashion, iterating through the length
+ 	 * histogram bins. Each bin is a trapezoid:
+ 	 *
+ 	 *       P(x2)
+ 	 *        /|
+ 	 *       / |
+ 	 * P(x1)/  |
+ 	 *     |   |
+ 	 *     |   |
+ 	 *  ---+---+--
+ 	 *     x1  x2
+ 	 *
+ 	 * where x1 and x2 are the boundaries of the current histogram, and P(x1)
+ 	 * and P(x1) are the cumulative fraction of tuples at the boundaries.
+ 	 *
+ 	 * The area of each trapezoid is 1/2 * (P(x2) + P(x1)) * (x2 - x1)
+ 	 *
+ 	 * The first bin contains the lower bound passed by the caller, so we
+ 	 * use linear interpolation between the previous and next histogram bin
+ 	 * boundary to calculate P(x1). Likewise for the last bin: we use linear
+ 	 * interpolation to calculate P(x2). For the bins in between, x1 and x2
+ 	 * lie on histogram bin boundaries, so P(x1) and P(x2) are simply:
+ 	 * P(x1) =    (bin index) / (number of bins)
+ 	 * P(x2) = (bin index + 1 / (number of bins)
+ 	 */
+ 
+ 	/* First bin, the one that contains lower bound */
+ 	i = length_hist_bsearch(length_hist_values, length_hist_nvalues, length1, equal);
+ 	if (i >= length_hist_nvalues - 1)
+ 		return 1.0;
+ 
+ 	if (i < 0)
+ 	{
+ 		i = 0;
+ 		pos = 0.0;
+ 	}
+ 	else
+ 	{
+ 		/* interpolate length1's position in the bin */
+ 		pos = get_len_position(length1,
+ 							   DatumGetFloat8(length_hist_values[i]),
+ 							   DatumGetFloat8(length_hist_values[i + 1]));
+ 	}
+ 	PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
+ 	B = length1;
+ 
+ 	/*
+ 	 * In the degenerate case that length1 == length2, simply return P(length1).
+ 	 * This is not merely an optimization: if length1 == length2, we'd divide
+ 	 * by zero later on.
+ 	 */
+ 	if (length2 == length1)
+ 		return PB;
+ 
+ 	/*
+ 	 * Loop through all the bins, until we hit the last bin, the one that
+ 	 * contains the upper bound. (if lower and upper bounds are in the same
+ 	 * bin, this falls out immediately)
+ 	 */
+ 	area = 0.0;
+ 	for (; i < length_hist_nvalues - 1; i++)
+ 	{
+ 		double bin_upper = DatumGetFloat8(length_hist_values[i + 1]);
+ 
+ 		/* check if we've reached the last bin */
+ 		if (!(bin_upper < length2 || (equal && bin_upper <= length2)))
+ 			break;
+ 
+ 		/* the upper bound of previous bin is the lower bound of this bin */
+ 		A = B; PA = PB;
+ 
+ 		B = bin_upper;
+ 		PB = (double) i / (double) (length_hist_nvalues - 1);
+ 
+ 		/*
+ 		 * Add the area of this trapezoid to the total. The point of the
+ 		 * if-check is to avoid NaN, in the corner case that PA == PB == 0, and
+ 		 * B - A == Inf. The area of a zero-height trapezoid (PA == PB == 0) is
+ 		 * zero, regardless of the width (B - A).
+ 		 */
+ 		if (PA > 0 || PB > 0)
+ 			area += 0.5 * (PB + PA) * (B - A);
+ 	}
+ 
+ 	/* Last bin */
+ 	A = B; PA = PB;
+ 
+ 	B = length2; /* last bin ends at the query upper bound */
+ 	if (i >= length_hist_nvalues - 1)
+ 		pos = 0.0;
+ 	else
+ 	{
+ 		if (DatumGetFloat8(length_hist_values[i]) == DatumGetFloat8(length_hist_values[i + 1]))
+ 			pos = 0.0;
+ 		else
+ 			pos = get_len_position(length2, DatumGetFloat8(length_hist_values[i]), DatumGetFloat8(length_hist_values[i + 1]));
+ 	}
+ 	PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
+ 
+ 	if (PA > 0 || PB > 0)
+ 		area += 0.5 * (PB + PA) * (B - A);
+ 
+ 	/*
+ 	 * Ok, we have calculated the area, ie. the integral. Divide by width to
+ 	 * get the requested average.
+ 	 */
+ 	frac = area / (length2 - length1);
+ 
+ 	return frac;
+ }
+ 
+ /*
+  * Calculate selectivity of "<@" operator using histograms of range lower bounds
+  * and histogram of range lengths.
+  *
+  * Note, this is "var <@ const", ie. estimate the fraction of ranges that
+  * fall within the constant lower and upper bounds.
+  *
+  * The caller has already checked that lower and upper are finite
+  */
+ static double
+ calc_hist_selectivity_contained(TypeCacheEntry *typcache,
+ 								RangeBound *lower, RangeBound *upper,
+ 								RangeBound *hist_lower,	int hist_nvalues,
+ 								Datum *length_hist_values, int length_hist_nvalues)
+ {
+ 	int			i,
+ 				upper_index;
+ 	float8		prev_dist;
+ 	double		bin_width;
+ 	double		upper_bin_width;
+ 	double		sum_frac;
+ 
+ 	/*
+ 	 * Begin by finding the bin containing the upper bound, in the lower bound
+ 	 * histogram. Any range with a lower bound > constant upper bound can't
+ 	 * match, ie. there are no matches in bins greater than upper_index.
+ 	 */
+ 	upper->inclusive = !upper->inclusive;
+ 	upper->lower = true;
+ 	upper_index = rbound_bsearch(typcache, upper, hist_lower, hist_nvalues,
+ 								 false);
+ 
+ 	/*
+ 	 * Calculate upper_bin_width, ie. the fraction of the (upper_index,
+ 	 * upper_index + 1) bin which is greater than upper bound of query range
+ 	 * using linear interpolation of subdiff function.
+ 	 */
+ 	if (upper_index >= 0 && upper_index < hist_nvalues - 1)
+ 		upper_bin_width = get_position(typcache, upper,
+ 									   &hist_lower[upper_index],
+ 									   &hist_lower[upper_index + 1]);
+ 	else
+ 		upper_bin_width = 0.0;
+ 
+ 	/*
+ 	 * In the loop, dist and prev_dist are the distance of the "current" bin's
+ 	 * lower and upper bounds from the constant upper bound.
+ 	 *
+ 	 * bin_width represents the width of the current bin. Normally it is 1.0,
+ 	 * meaning a full width bin, but can be less in the corner cases: start
+ 	 * and end of the loop. We start with bin_width = upper_bin_width, because
+ 	 * we begin at the bin containing the upper bound.
+ 	 */
+ 	prev_dist = 0.0;
+ 	bin_width = upper_bin_width;
+ 
+ 	sum_frac = 0.0;
+ 	for (i = upper_index; i >= 0; i--)
+ 	{
+ 		double		dist;
+ 		double		length_hist_frac;
+ 		bool		final_bin = false;
+ 
+ 		/*
+ 		 * dist -- distance from upper bound of query range to lower bound of
+ 		 * the current bin in the lower bound histogram. Or to the lower bound
+ 		 * of the constant range, if this is the final bin, containing the
+ 		 * constant lower bound.
+ 		 */
+ 		if (range_cmp_bounds(typcache, &hist_lower[i], lower) < 0)
+ 		{
+ 			dist = get_distance(typcache, lower, upper);
+ 			/*
+ 			 * Subtract from bin_width the portion of this bin that we want
+ 			 * to ignore.
+ 			 */
+ 			bin_width -= get_position(typcache, lower, &hist_lower[i],
+ 									  &hist_lower[i + 1]);
+ 			if (bin_width < 0.0)
+ 				bin_width = 0.0;
+ 			final_bin = true;
+ 		}
+ 		else
+ 			dist = get_distance(typcache, &hist_lower[i], upper);
+ 
+ 		/*
+ 		 * Estimate the fraction of tuples in this bin that are narrow enough
+ 		 * to not exceed the distance to the upper bound of the query range.
+ 		 */
+ 		length_hist_frac = calc_length_hist_frac(length_hist_values,
+ 												 length_hist_nvalues,
+ 												 prev_dist, dist, true);
+ 
+ 		/*
+ 		 * Add the fraction of tuples in this bin, with a suitable length,
+ 		 * to the total.
+ 		 */
+ 		sum_frac += length_hist_frac * bin_width / (double) (hist_nvalues - 1);
+ 
+ 		if (final_bin)
+ 			break;
+ 
+ 		bin_width = 1.0;
+ 		prev_dist = dist;
+ 	}
+ 
+ 	return sum_frac;
+ }
+ 
+ /*
+  * Calculate selectivity of "@>" operator using histograms of range lower bounds
+  * and histogram of range lengths.
+  *
+  * Note, this is "var @> const", ie. estimate the fraction of ranges that
+  * contain the constant lower and upper bounds.
+  */
+ static double
+ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
+ 							   RangeBound *lower, RangeBound *upper,
+ 							   RangeBound *hist_lower, int hist_nvalues,
+ 							   Datum *length_hist_values, int length_hist_nvalues)
+ {
+ 	int			i,
+ 				lower_index;
+ 	double		bin_width,
+ 				lower_bin_width;
+ 	double		sum_frac;
+ 	float8		prev_dist;
+ 
+ 	/* Find the bin containing the lower bound of query range. */
+ 	lower_index = rbound_bsearch(typcache, lower, hist_lower, hist_nvalues,
+ 								 true);
+ 
+ 	/*
+ 	 * Calculate lower_bin_width, ie. the fraction of the of (lower_index,
+ 	 * lower_index + 1) bin which is greater than lower bound of query range
+ 	 * using linear interpolation of subdiff function.
+ 	 */
+ 	if (lower_index >= 0 && lower_index < hist_nvalues - 1)
+ 		lower_bin_width = get_position(typcache, lower, &hist_lower[lower_index],
+ 								  &hist_lower[lower_index + 1]);
+ 	else
+ 		lower_bin_width = 0.0;
+ 
+ 	/*
+ 	 * Loop through all the lower bound bins, smaller than the query lower
+ 	 * bound. In the loop, dist and prev_dist are the distance of the "current"
+ 	 * bin's lower and upper bounds from the constant upper bound. We begin
+ 	 * from query lower bound, and walk backwards, so the first bin's upper
+ 	 * bound is the query lower bound, and its distance to the query upper
+ 	 * bound is the length of the query range.
+ 	 *
+ 	 * bin_width represents the width of the current bin. Normally it is 1.0,
+ 	 * meaning a full width bin, except for the first bin, which is only
+ 	 * counted up to the constant lower bound.
+ 	 */
+ 	prev_dist = get_distance(typcache, lower, upper);
+ 	sum_frac = 0.0;
+ 	bin_width = lower_bin_width;
+ 	for (i = lower_index; i >= 0; i--)
+ 	{
+ 		float8		dist;
+ 		double		length_hist_frac;
+ 
+ 		/*
+ 		 * dist -- distance from upper bound of query range to current
+ 		 * value of lower bound histogram or lower bound of query range (if
+ 		 * we've reach it).
+ 		 */
+ 		dist = get_distance(typcache, &hist_lower[i], upper);
+ 
+ 		/*
+ 		 * Get average fraction of length histogram which covers intervals
+ 		 * longer than (or equal to) distance to upper bound of query range.
+ 		 */
+ 		length_hist_frac =
+ 			1.0 - calc_length_hist_frac(length_hist_values,
+ 										length_hist_nvalues,
+ 										prev_dist, dist, false);
+ 
+ 		sum_frac += length_hist_frac * bin_width / (double) (hist_nvalues - 1);
+ 
+ 		bin_width = 1.0;
+ 		prev_dist = dist;
+ 	}
+ 
+ 	return sum_frac;
+ }
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
***************
*** 29,34 ****
--- 29,36 ----
  #include "utils/builtins.h"
  #include "utils/rangetypes.h"
  
+ static int float8_qsort_cmp(const void *a1, const void *a2);
+ static int range_bound_qsort_cmp(const void *a1, const void *a2, void *arg);
  static void compute_range_stats(VacAttrStats *stats,
  		   AnalyzeAttrFetchFunc fetchfunc, int samplerows, double totalrows);
  
***************
*** 57,62 **** range_typanalyze(PG_FUNCTION_ARGS)
--- 59,84 ----
  }
  
  /*
+  * Comparison function for float8 which are used for measurement of range
+  * size.
+  */
+ static int
+ float8_qsort_cmp(const void *a1, const void *a2)
+ {
+ 	const float8 *f1,
+ 			   *f2;
+ 
+ 	f1 = (const float8 *) a1;
+ 	f2 = (const float8 *) a2;
+ 	if (*f1 < *f2)
+ 		return -1;
+ 	else if (*f1 == *f2)
+ 		return 0;
+ 	else
+ 		return 1;
+ }
+ 
+ /*
   * Comparison function for sorting RangeBounds.
   */
  static int
***************
*** 77,82 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 99,105 ----
  					int samplerows, double totalrows)
  {
  	TypeCacheEntry *typcache = (TypeCacheEntry *) stats->extra_data;
+ 	bool		has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
  	int			null_cnt = 0;
  	int			non_null_cnt = 0;
  	int			non_empty_cnt = 0;
***************
*** 85,94 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 108,119 ----
  	int			slot_idx;
  	int			num_bins = stats->attr->attstattarget;
  	int			num_hist;
+ 	float8	   *lengths;
  	RangeBound *lowers, *uppers;
  	double		total_width = 0;
  
  	/* Allocate memory for arrays of range bounds. */
+ 	lengths = (float8 *) palloc(sizeof(float8) * samplerows);
  	lowers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
  	uppers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
  
***************
*** 101,106 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 126,132 ----
  		RangeType  *range;
  		RangeBound	lower,
  					upper;
+ 		float8		length;
  
  		vacuum_delay_point();
  
***************
*** 122,127 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 148,179 ----
  		range = DatumGetRangeType(value);
  		range_deserialize(typcache, range, &lower, &upper, &empty);
  
+ 		if (empty)
+ 		{
+ 			/* Length of empty range is zero */
+ 			length = 0.0;
+ 		}
+ 		else if (lower.infinite || upper.infinite)
+ 		{
+ 			/* Length of any kind of infinite range is initite */
+ 			length = get_float8_infinity();
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * For ordinal range use subdiff function between upper and lower
+ 			 * bound values or use default value of 1.0 if no subdiff is
+ 			 * available.
+ 			 */
+ 			if (has_subdiff)
+ 				length = DatumGetFloat8(FunctionCall2Coll(
+ 												&typcache->rng_subdiff_finfo,
+ 												typcache->rng_collation,
+ 												upper.val, lower.val));
+ 			else
+ 				length = 1.0;
+ 		}
+ 
  		if (!empty)
  		{
  			/* Fill bound values for further usage in histograms */
***************
*** 132,137 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 184,190 ----
  		else
  			empty_cnt++;
  
+ 		lengths[non_null_cnt] = length;
  		non_null_cnt++;
  	}
  
***************
*** 141,146 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 194,200 ----
  	if (non_null_cnt > 0)
  	{
  		Datum	   *bound_hist_values;
+ 		Datum	   *length_hist_values;
  		int			pos,
  					posfrac,
  					delta,
***************
*** 210,221 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
  			slot_idx++;
  		}
  
  		/* Store the fraction of empty ranges */
  		emptyfrac = (float4 *) palloc(sizeof(float4));
  		*emptyfrac = ((double) empty_cnt) / ((double) non_null_cnt);
! 		stats->stakind[slot_idx] = STATISTIC_KIND_RANGE_EMPTY_FRAC;
  		stats->stanumbers[slot_idx] = emptyfrac;
  		stats->numnumbers[slot_idx] = 1;
  		slot_idx++;
  
  		MemoryContextSwitchTo(old_cxt);
--- 264,333 ----
  			slot_idx++;
  		}
  
+ 		/*
+ 		 * Generate a length histogram slot entry if we've collected at least
+ 		 * two length values which can be not distinct.
+ 		 */
+ 		if (non_null_cnt >= 2)
+ 		{
+ 			/*
+ 			 * Ascending sort of range lengths for further filling of
+ 			 * histogram
+ 			 */
+ 			qsort(lengths, non_null_cnt, sizeof(float8), float8_qsort_cmp);
+ 
+ 			num_hist = non_null_cnt;
+ 			if (num_hist > num_bins)
+ 				num_hist = num_bins + 1;
+ 
+ 			length_hist_values = (Datum *) palloc(num_hist * sizeof(Datum));
+ 
+ 			/*
+ 			 * The object of this loop is to copy the first and last lengths[]
+ 			 * entries along with evenly-spaced values in between. So the i'th
+ 			 * value is lengths[(i * (nvals - 1)) / (num_hist - 1)]. But
+ 			 * computing that subscript directly risks integer overflow when the
+ 			 * stats target is more than a couple thousand.  Instead we add
+ 			 * (nvals - 1) / (num_hist - 1) to pos at each step, tracking the
+ 			 * integral and fractional parts of the sum separately.
+ 			 */
+ 			delta = (non_null_cnt - 1) / (num_hist - 1);
+ 			deltafrac = (non_null_cnt - 1) % (num_hist - 1);
+ 			pos = posfrac = 0;
+ 
+ 			for (i = 0; i < num_hist; i++)
+ 			{
+ 				length_hist_values[i] = Float8GetDatum(lengths[pos]);
+ 				pos += delta;
+ 				posfrac += deltafrac;
+ 				if (posfrac >= (num_hist - 1))
+ 				{
+ 					/* fractional part exceeds 1, carry to integer part */
+ 					pos++;
+ 					posfrac -= (num_hist - 1);
+ 				}
+ 			}
+ 			stats->staop[slot_idx] = Float8LessOperator;
+ 			stats->stavalues[slot_idx] = length_hist_values;
+ 			stats->numvalues[slot_idx] = num_hist;
+ 			/* We are storing float8 values */
+ 			stats->statypid[slot_idx] = FLOAT8OID;
+ 			stats->statyplen[slot_idx] = sizeof(float8);
+ #ifdef USE_FLOAT8_BYVAL
+ 			stats->statypbyval[slot_idx] = true;
+ #else
+ 			stats->statypbyval[slot_idx] = false;
+ #endif
+ 			stats->statypalign[slot_idx] = 'd';
+ 		}
+ 
  		/* Store the fraction of empty ranges */
  		emptyfrac = (float4 *) palloc(sizeof(float4));
  		*emptyfrac = ((double) empty_cnt) / ((double) non_null_cnt);
! 
  		stats->stanumbers[slot_idx] = emptyfrac;
  		stats->numnumbers[slot_idx] = 1;
+ 		stats->stakind[slot_idx] = STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM;
  		slot_idx++;
  
  		MemoryContextSwitchTo(old_cxt);
*** a/src/include/catalog/pg_operator.h
--- b/src/include/catalog/pg_operator.h
***************
*** 527,532 **** DATA(insert OID = 671 (  "<>"	   PGNSP PGUID b f f	701  701	16 671 670 float8ne
--- 527,533 ----
  DESCR("not equal");
  DATA(insert OID = 672 (  "<"	   PGNSP PGUID b f f	701  701	16 674 675 float8lt scalarltsel scalarltjoinsel ));
  DESCR("less than");
+ #define Float8LessOperator	672
  DATA(insert OID = 673 (  "<="	   PGNSP PGUID b f f	701  701	16 675 674 float8le scalarltsel scalarltjoinsel ));
  DESCR("less than or equal");
  DATA(insert OID = 674 (  ">"	   PGNSP PGUID b f f	701  701	16 672 673 float8gt scalargtsel scalargtjoinsel ));
*** a/src/include/catalog/pg_statistic.h
--- b/src/include/catalog/pg_statistic.h
***************
*** 269,279 **** typedef FormData_pg_statistic *Form_pg_statistic;
  #define STATISTIC_KIND_DECHIST	5
  
  /*
!  * An "empty frac" slot describes the fraction of empty ranges in a range-type
!  * column.  stavalues is not used and should be NULL.  stanumbers contains a
!  * single entry, the fraction of empty ranges (0.0 to 1.0).
   */
! #define STATISTIC_KIND_RANGE_EMPTY_FRAC  6
  
  /*
   * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
--- 269,283 ----
  #define STATISTIC_KIND_DECHIST	5
  
  /*
!  * A "length histogram" slot describes the distribution of range lengths in
!  * rows of a range-type column. stanumbers contains a single entry, the
!  * fraction of empty ranges. stavalues is a histogram of non-empty lengths, in
!  * a format similar to STATISTIC_KIND_HISTOGRAM: it contains M (>=2) range
!  * values that divide the column data values into M-1 bins of approximately
!  * equal population. The lengths are stores as float8s, as measured by the
!  * range type's subdiff function. Only non-null rows are considered.
   */
! #define STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM  6
  
  /*
   * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
#40Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#39)
Re: Statistics and selectivity estimation for ranges

On Wed, Mar 13, 2013 at 11:10 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 01.03.2013 16:22, Alexander Korotkov wrote:

On Tue, Mar 12, 2013 at 8:03 PM, Heikki Linnakangas<hlinnakangas@**
vmware.com <hlinnakangas@vmware.com>

wrote:

So, after some hacking, I ended up with this version. I find it more

readable, I hope I didn't miss anything. This seems to produce results
that
are close, but not identical, to the original patch. I'm not sure where
the
discrepancy is coming from, or which patch is more correct in that
respect.
I'll continue from this tomorrow, but if you have the time, please take a
look and let me know what you think.

I've read your explanation and version of patch. In general it seems
correct to me.
There is one point why I have breaked up abstraction in some functions is
infinities. For example, in calc_length_hist_frac one or both of length1
and length2 can be infinity. In the line

frac = area / (length2 - length1);

you can get NaN result. I've especially adjusted the code to get more of
less correct result in this case.

Hmm, good point. I think I managed to fix those cases in the attached
version. Is there any other corner case that I missed?

Did you try test case by Jeff Davis on this thread?
/messages/by-id/1355167304.3896.37.camel@jdavis
I try it with attached version of patch and get NaN estimate.

------
With best regards,
Alexander Korotkov.

#41Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Alexander Korotkov (#40)
Re: Statistics and selectivity estimation for ranges

On 01.03.2013 16:22, Alexander Korotkov wrote:

On Wed, Mar 13, 2013 at 11:10 PM, Heikki Linnakangas<
hlinnakangas@vmware.com> wrote:

On 01.03.2013 16:22, Alexander Korotkov wrote:

frac = area / (length2 - length1);

you can get NaN result. I've especially adjusted the code to get more of
less correct result in this case.

Hmm, good point. I think I managed to fix those cases in the attached
version. Is there any other corner case that I missed?

Did you try test case by Jeff Davis on this thread?
/messages/by-id/1355167304.3896.37.camel@jdavis
I try it with attached version of patch and get NaN estimate.

Thanks, fixed that too.

Committed with a little bit more clean up and fixes. Thank you for
bearing with this long process :-). And many thanks Jeff for the review,
and sorry that I forgot to credit you for that in the commit message.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers